Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)

Name: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant)
Availability: InStock
Author: utopman

by utopman·Jun 7, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainNiche Gem

450k context on 32GB VRAM using turboquant KV cache compression.

Strengths

•Specific llama.cpp fork (TheTom/llama-cpp-turboquant) with turbo3 cache quantization.
•Concrete memory budget: 28.5GB model + 2.7GB for 450k tokens.
•Windows-native scripts with compiled DLL management, Linux-compatible.

Weaknesses

•Blog post documenting a configuration, not a reusable tool or product.
•RTX 5090 not widely available; author admits untested beyond 265k context.

Post Description

Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support

This AI generated blog article is a kind of "report" of what and how I did and result exemples.

I hope this can be usefull to some peopole.

Note : I am not much intersted in having success with this article, I mainly want to share what I think is an interesting use of a 5090. I generated the blog page telling AI to be compliant with hn "rules" and remain factual.

It's definitely not perfect, done rather quickly, not properly tested over 265K context. please forgive my lazyness :) . I am just enthousiast right now about what can be done on a 5090.