Codebook of 450k+ unique words and phrases acts as a text compressor
450k-entry codebook compresses text by 61% using phrase substitution.
GreedyPhrase Tokenizer: Maximizing Effective Context via Greedy Phrase Compression
Phrase-mining beats tiktoken compression 1.21x with 1/3 vocab size, but niche for token optimization.
LLM builders optimizing context windows and inference throughput
Tiktoken · SentencePiece · BPE tokenizers
Benchmark (enwik9, 1 GB):
Tokenizer Vocab Size Total Tokens Ratio Throughput GreedyPhrase 65,536 222,805,405 4.49x 47 MB/s Tiktoken o200k_base (GPT-4o) 200,019 270,616,861 3.70x 4.35 MB/s Tiktoken cl100k_base (GPT-4) 100,277 273,662,103 3.65x 7.13 MB/s
GreedyPhrase: 1.23x better than GPT-4, 1.21x better than GPT-4o. 1.5-3x smaller vocab, 6-11x higher encoding throughput.
How It Works:
1. Phrase Mining — Split into atoms (words, punctuation, whitespace). Mine bigrams/trigrams. Top phrases fill 95% vocab slots.
2. BPE Fallback — Train BPE on residual byte sequences. Fills remaining 5% vocab.
3. Greedy Encoding — Longest-match-first Trie. Byte fallback for unknowns (zero OOV).
450k-entry codebook compresses text by 61% using phrase substitution.
Nine times faster than tiktoken-rs with swappable lexer backends for benchmarking.
Pure C99 GPT with SIMD beats Python 4,600x; drop two files into any project.
This runs entirely in the browser and routes encoding through the WebCodecs path for GPU-accelerated speed, falling back to a WASM engine when needed — that dual-engine approach is the real hook. The UI gives sensible presets (Discord, WhatsApp, email) and clear privacy messaging, but the pitch leans heavily on speed claims without visible cross-browser compatibility or quality/bitrate tradeoff details.
Collection of GPT-Image-2 prompts when official docs already exist.
AI animal hybrid generator when Midjourney and DALL-E already do this.