ChonkLM – Tiny language models running offline in the browser

Name: ChonkLM – Tiny language models running offline in the browser
Availability: InStock
Author: bilalba

by bilalba·May 9, 2026·6 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerZero to OneWizardryNiche Gem

Runs GGUF models in the browser via custom WGSL shaders when cloud APIs ignore tiny models.

Strengths

•Custom WGSL shader implementation bypasses ONNX quirks for better TPS on tiny models.
•Static Cloudflare hosting means zero server costs and true client-side privacy.
•Curated list of <500M parameter models fills a gap left by major API providers.

Weaknesses

•Limited to tiny models; multi-turn conversation quality degrades quickly on <500M params.
•Browser cache eviction risk means large models may need frequent re-downloading.

Post Description

I had been looking to try <500M parameter language models but you wouldn't find an API to try them anywhere, so I built this cloudflare hosted static website that hosts weights and built an inference runtime for these models that uses WebGPU and runs inference from your browser.

These are only so useful in a multi-turn conversation but it's still interesting to see what you can pack in a <250mb model.

I tried using ONNX versions earlier, but there were too many quirks of using them with language models and the TPS wasn't too impressive. Inspired by svenflow/webgpu-gemma, I put my codex and claude to the task of writing WGSL to run inference for GGUF versions of these models.

Once you load this website and a model, it should load offline too, until your browser evicts the model from the cache.