Makes local LLMs faster and more reliable by optimizing for your device

Name: Makes local LLMs faster and more reliable by optimizing for your device
Availability: InStock
Author: tanavc

by tanavc·Jun 30, 2026·5 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My Problem

Dynamic KV cache sizing beats Ollama's wasteful 4096-token default allocation.

Strengths

•Right-sizes KV cache per request instead of Ollama's fixed 4096-token over-allocation
•Four-tier RAM pressure system proactively downgrades precision before swap risk
•Transparent proxy requires zero code changes to existing Ollama integrations

Weaknesses

•Ollama-only, doesn't help LM Studio or llama.cpp users
•Another proxy layer in the stack means another potential failure point

Post Description

Time to first token is 39% faster Agent wall times decrease by 46% No swaps

Tracks your resource usage in real-time and adjusts how the model runs so that it works perfectly on your device.

Implements KV cache sizing, prefix caching, live RAM pressure management, context trimming, KV quantization, and more.

Built a ton of features

Similar Projects

Developer Tools●Mid

LLM Gateway – Simple API format converter for LLM providers

LiteLLM already does this with more providers, more features, and way more maturity.

Ship It

modinfo

203mo ago

Developer Tools●●Solid

glide – LLM cascade proxy, auto-switches models before timeout

TTFT-aware model fallback—avoids timeouts by hedging between Opus, Sonnet, Haiku automatically.

Solve My ProblemNiche Gem

phanisaimuni116

113mo ago

AI/ML●●●Banger

Rapid-MLX – Run local LLMs on Mac, 2-3x faster than alternatives

Claims 4.2x Ollama speed with 0.08s cached TTFT on Apple Silicon.

WizardrySolve My Problem

raullen

942mo ago

Security●Mid

See what your employees are prompting LLMs (without network proxies)

Another AI security wrapper in a crowded market, but agent-side integration is interesting.

Bold Bet

asilozyildirim

402mo ago

AI/ML●●●Banger

Genosis – LLM cost optimization that learns from your traffic

Not a proxy — optimizes Anthropic and OpenAI caching without adding latency or seeing your data.

Big BrainSolve My ProblemDark Horse

samherder

203mo ago

Finance●●Solid

We Made Nasdaq Parsing Even Faster (and More Reliable)

They stopped pretending chunking at arbitrary byte offsets was fine and instead scan once to build message boundaries, then binary-search for clean split points — that simple change eliminates the OOM-by-design scenario. Couple that with SIMD-aware prefetch tuning (different distances for AVX2 vs AVX-512) and you get practical microarch-aware engineering, not just benchmark stunts; I want this shipped as a library or tool so other firms can stop reinventing the same footguns.

WizardryNiche Gem

sundancegh

204mo ago