Back to browse
Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

by utopman·Jun 7, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

450k context on 32GB VRAM using turboquant KV cache compression.

Strengths
  • Specific llama.cpp fork (TheTom/llama-cpp-turboquant) with turbo3 cache quantization.
  • Concrete memory budget: 28.5GB model + 2.7GB for 450k tokens.
  • Windows-native scripts with compiled DLL management, Linux-compatible.
Weaknesses
  • Blog post documenting a configuration, not a reusable tool or product.
  • RTX 5090 not widely available; author admits untested beyond 265k context.
Category
Target Audience

Local LLM enthusiasts, ML engineers running models on consumer hardware

Similar To

Ollama · LM Studio · llama.cpp

Post Description

Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support

This AI generated blog article is a kind of "report" of what and how I did and result exemples.

I hope this can be usefull to some peopole.

Note : I am not much intersted in having success with this article, I mainly want to share what I think is an interesting use of a 5090. I generated the blog page telling AI to be compliant with hn "rules" and remain factual.

It's definitely not perfect, done rather quickly, not properly tested over 265K context. please forgive my lazyness :) . I am just enthousiast right now about what can be done on a 5090.

Similar Projects

AI/ML●●●Banger

Llama CPU Benchmarks

Proves speculative decoding slows down 4B models on 4-core CPUs despite marketing claims.

Big BrainDark Horse
muthuishere
2017d ago