Back to browse
GitHub Repository

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

16,475 starsPython

oMLX – Native Mac inference server that persists KV cache to SSD

by jundot·Feb 19, 2026·1 point·0 comments

AI Analysis

●●●BangerSolve My ProblemWizardryShip It

SSD-cached KV blocks dodge re-prefill tax on context shifts—Claude Code now viable locally.

Strengths
  • Paged SSD KV cache is genuinely clever: solves the specific pain of coding agents that invalidate prefixes mid-session.
  • Menubar app + OpenAI-compatible API + built-in dashboard removes friction—real product, not a research demo.
  • Continuous batching + multi-model LRU eviction + copy-on-write shows solid engineering depth beyond the core idea.
Weaknesses
  • Apple Silicon only—massive addressable market, but eliminates Windows/Linux users (most of the inference server market).
  • Text-only LLMs, no VLM/OCR yet—limits use cases vs. vLLM or Ollama's broader model support.
Target Audience

Apple Silicon Mac users running local LLMs, especially those using coding agents like Claude Code.

Similar To

Ollama · vLLM · LM Studio

Post Description

I built an open-source LLM inference server optimized for Apple Silicon. The main motivation was coding agents - tools like Claude Code send requests where the context prefix keeps shifting, invalidating KV cache. A few turns later the agent circles back, and your Mac has to re-prefill the entire context from scratch.

oMLX solves this with paged SSD caching. Every KV cache block is persisted to disk. When a previous prefix returns, it's restored instantly instead of being recomputed. This makes long coding sessions significantly faster.

It also supports continuous batching for concurrent requests, multi-model serving (LLM + embedding + reranker) with LRU eviction, block-level KV cache with prefix sharing and copy-on-write, OpenAI and Anthropic compatible APIs, and tool calling.

Ships as a signed macOS menubar app with a web dashboard.

GitHub: https://github.com/jundot/omlx

Similar Projects

AI/ML●●●Banger

Orion – Native Training LLMs on the Apple Neural Engine Without CoreML

Direct ANE access bypasses CoreML to enable training—genuinely novel Apple Silicon unlock.

WizardryZero to OneBig Brain
mechramc
213mo ago
AI/MLMid

Running OpenClaw on a managed Mac Mini 4 instance

Shows how to run OpenClaw agents on a rented Mac mini M4 and use the 38 TOPS Neural Engine for low-latency local inference while offloading heavy work to Scaleway's Generative APIs. Practical details — hourly billing, remote desktop access, and step-by-step tutorials — make it useful for PoCs, but it's essentially a cloud-provider integration rather than a new agent platform.

Niche GemSolve My Problem
enthusaist
204mo ago