Back to browse
ChonkLM – Tiny language models running offline in the browser

ChonkLM – Tiny language models running offline in the browser

by bilalba·May 9, 2026·6 points·0 comments

AI Analysis

●●●BangerZero to OneWizardryNiche Gem

Runs GGUF models in the browser via custom WGSL shaders when cloud APIs ignore tiny models.

Strengths
  • Custom WGSL shader implementation bypasses ONNX quirks for better TPS on tiny models.
  • Static Cloudflare hosting means zero server costs and true client-side privacy.
  • Curated list of <500M parameter models fills a gap left by major API providers.
Weaknesses
  • Limited to tiny models; multi-turn conversation quality degrades quickly on <500M params.
  • Browser cache eviction risk means large models may need frequent re-downloading.
Category
Target Audience

Developers experimenting with edge AI, WebGPU enthusiasts, and privacy-focused users

Similar To

svenflow/webgpu-gemma · WebLLM · Transformers.js

Post Description

I had been looking to try <500M parameter language models but you wouldn't find an API to try them anywhere, so I built this cloudflare hosted static website that hosts weights and built an inference runtime for these models that uses WebGPU and runs inference from your browser.

These are only so useful in a multi-turn conversation but it's still interesting to see what you can pack in a <250mb model.

I tried using ONNX versions earlier, but there were too many quirks of using them with language models and the TPS wasn't too impressive. Inspired by svenflow/webgpu-gemma, I put my codex and claude to the task of writing WGSL to run inference for GGUF versions of these models.

Once you load this website and a model, it should load offline too, until your browser evicts the model from the cache.

Similar Projects

Host any GGUF model in one command

Ollama and llama.cpp server already do this with more maturity and model support.

Ship It
gauravvij137
302mo ago