Back to browse
Bonsai 1.7B ternary model at 442T/s on M4 Max

Bonsai 1.7B ternary model at 442T/s on M4 Max

by hhuytho·May 4, 2026·13 points·3 comments

AI Analysis

●●●BangerWizardryDark Horse

Autonomous agent wrote custom Metal kernels boosting decode speed 42% over upstream llama.cpp.

Strengths
  • Custom GPU kernels at matvec/FFN/KV-cache layer shape-specialized for Bonsai 1.7B Q2_0 decode.
  • Verified identical numerical output to reference build with top-1 token match confirmation.
  • Drop-in replacement includes model file, chat REPL, benchmark scripts, and OpenAI-compatible API server.
Weaknesses
  • Apple Silicon only; no Intel Mac or CPU-only build support currently available.
  • Q2_0 quantization introduces small accuracy delta versus F16 precision models.
Category
Target Audience

Mac developers, local LLM enthusiasts, ML engineers

Similar To

llama.cpp · MLX · Ollama

Post Description

We took a recently released Bonsai 1.7B ternary model from PrismML (https://github.com/PrismML-Eng/Bonsai-demo) and ran our agentic evolution search on it for 6 hours to optimize the Metal kernels. The search was fully autonomous. Measured against unmodified upstream llama.cpp at the same Bonsai/Q2_0 commit, same M4 Max: - tg128: 309.82 → 442.42 t/s (+42.0%) - pp512: 4250.32 → 4622.63 t/s (+8.8%)

Similar Projects