LLM inference slowdown fixed (177 experiments, +37% attention) – in 48h

Name: LLM inference slowdown fixed (177 experiments, +37% attention) – in 48h
Availability: InStock
Author: christinetyip

by christinetyip·Apr 15, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardrySolve My ProblemBig Brain

Fused int4 attention kernel on Metal keeps LLM speed constant as context grows.

Strengths

•Solves KV cache slowdown with a custom Metal kernel implementation.
•Maintains constant decode speed regardless of growing conversation context length.
•Open-source code allows integration into existing local inference stacks.

Weaknesses

•Apple Silicon only limits applicability to Mac users specifically.
•Requires manual integration rather than a drop-in packaged tool.

Similar Projects

AI/ML●●Solid

STAR prompting fixes Car Wash Problem on Sonnet 4.5 (0%->85%)

They ran a variable-isolation study across five prompt layers with 20 runs per condition and shipped experiment.py and results so you can reproduce which layer actually supplies the missing implicit fact. It’s a focused, practical read for anyone designing layered system prompts, but it feels niche and would be more persuasive with cross-model baselines and clearer statistical reporting.

Big BrainNiche Gem

midmost44

203mo ago

Developer Tools●●Solid