Slopsome – a VRAM fit calculator and tok/s database for local LLMs
VRAM calculator with crowd-sourced tok/s benchmarks when model cards already exist.
SQLite-based LLM Inference Framework for Every Device
SQLite-based LLM inference hitting 210MB RSS beats OS paging with deterministic memory control.
Edge developers, embedded ML engineers, resource-constrained deployments
llama.cpp · MLC LLM · TensorFlow Lite
I built llm.sql, an LLM inference framework that reimagines the LLM execution pipeline as a series of structured SQL queries atop SQLite.
The motivation: Edge LLMs are getting better, but hardware remains a bottleneck, especially RAM (size and bandwidth).
When available memory is less than the model size and KV cache, the OS incurs page faults and swaps pages using LRU-like strategies, resulting in throughput degradation that's hard to notice and even harder to debug. In fact, the memory access pattern during LLM inference is deterministic - we know exactly which weights are needed and when. This means even Bélády's optimal page replacement algorithm is applicable here.
So instead of letting the OS manage memory, llm.sql takes over:
- Model parameters are stored in SQLite BLOB tables
- Computational logic is implemented as SQLite C extensions
- Memory management is handled explicitly, not by the OS
- Zero heavy dependencies. No PyTorch, no Transformers. Just Python, C, or C++
This gives us explicit, deterministic control over what's in memory at each step of inference.
Results:
Running Qwen2.5-0.5B-INT8 (~640MB model) with a peak RSS ~210MB and 7.40 tokens/s throughput.
Alpha version is available on GitHub: https://github.com/xuxianghong12/llm.sql
I'm the developer, happy to answer any technical questions about the design and implementation.
VRAM calculator with crowd-sourced tok/s benchmarks when model cards already exist.
In-browser SQLite with LLM sanitization when chat-with-data tools already exist.
SQL queries on CSV streams—instant, zero-setup alternative to awk and sqlite3 boilerplate.
Local-first RSS reader with built-in MCP server for agent-accessible subscription graphs.
Cleaner alternative to /usr/bin/time for quick memory profiling, no recompilation required.
Replaces slow TestContainers with dialect-specific in-memory SQL; fills real testing pain.