Running LLM on smartwatch – found llama.cpp loading model twice in RAM
Found llama.cpp loading models twice in RAM — fixed with host_ptr, 74% reduction.

450k context on 32GB VRAM using turboquant KV cache compression.
Local LLM enthusiasts, ML engineers running models on consumer hardware
Ollama · LM Studio · llama.cpp
This AI generated blog article is a kind of "report" of what and how I did and result exemples.
I hope this can be usefull to some peopole.
Note : I am not much intersted in having success with this article, I mainly want to share what I think is an interesting use of a 5090. I generated the blog page telling AI to be compliant with hn "rules" and remain factual.
It's definitely not perfect, done rather quickly, not properly tested over 265K context. please forgive my lazyness :) . I am just enthousiast right now about what can be done on a 5090.
Found llama.cpp loading models twice in RAM — fixed with host_ptr, 74% reduction.
One YAML config for three backends when Ollama already handles llama.cpp alone.
Finally one CLI for Ollama, llama.cpp, and vLLM instead of three separate tools.
Proves speculative decoding slows down 4B models on 4-core CPUs despite marketing claims.
Article promising 2026 tech but just tells you to use standard Ollama.
Swap software PRNG for hardware entropy in vLLM sampling, but niche use case with steep setup cost.