Back to browse
GitHub Repository

GreedyPhrase Tokenizer: Maximizing Effective Context via Greedy Phrase Compression

9 starsC++

GreedyPhrase – 1.21x better compression than GPT-4o tiktoken, 6x faster

by bazlightyear·Feb 18, 2026·1 point·0 comments

AI Analysis

●●SolidBig BrainWizardry

Phrase-mining beats tiktoken compression 1.21x with 1/3 vocab size, but niche for token optimization.

Strengths
  • Rigorous multi-dataset benchmarking (enwik9, WikiText-103, TinyStories) shows consistent gains; 2.24x compression on repetitive prose is substantial.
  • Iterative compounding strategy (phrase mining → bigram/trigram merging → BPE fallback) is elegant; scales to gigabyte corpora without OOMing on n-gram counts.
  • C backend parallelization (12-thread xxHash, mmap + speculative prefetch) achieves 36–47 MB/s throughput vs. tiktoken's 4–11 MB/s.
Weaknesses
  • Niche use case: only helps when context window or inference speed is bottleneck; most LLM users won't feel the difference.
  • No integration with popular LLM frameworks (Hugging Face, Ollama, LiteLLM); standalone CLI/Python API only—adoption friction is high.
Category
Target Audience

LLM builders optimizing context windows and inference throughput

Similar To

Tiktoken · SentencePiece · BPE tokenizers

Post Description

A greedy phrase-based tokenizer that outperforms GPT-4 and GPT-4o tokenizers on compression, with a smaller vocabulary.

Benchmark (enwik9, 1 GB):

Tokenizer Vocab Size Total Tokens Ratio Throughput GreedyPhrase 65,536 222,805,405 4.49x 47 MB/s Tiktoken o200k_base (GPT-4o) 200,019 270,616,861 3.70x 4.35 MB/s Tiktoken cl100k_base (GPT-4) 100,277 273,662,103 3.65x 7.13 MB/s

GreedyPhrase: 1.23x better than GPT-4, 1.21x better than GPT-4o. 1.5-3x smaller vocab, 6-11x higher encoding throughput.

How It Works:

1. Phrase Mining — Split into atoms (words, punctuation, whitespace). Mine bigrams/trigrams. Top phrases fill 95% vocab slots.

2. BPE Fallback — Train BPE on residual byte sequences. Fills remaining 5% vocab.

3. Greedy Encoding — Longest-match-first Trie. Byte fallback for unknowns (zero OOV).

Similar Projects

AI/ML●●●Banger

Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

Pure C99 GPT with SIMD beats Python 4,600x; drop two files into any project.

WizardryZero to One
Ajay__soni
4033mo ago
Productivity●●Solid

Video Compressor – 10x Faster via Dual-Engine (WebCodecs/WASM)

This runs entirely in the browser and routes encoding through the WebCodecs path for GPU-accelerated speed, falling back to a WASM engine when needed — that dual-engine approach is the real hook. The UI gives sensible presets (Discord, WhatsApp, email) and clear privacy messaging, but the pitch leans heavily on speed claims without visible cross-browser compatibility or quality/bitrate tradeoff details.

WizardryCrowd Pleaser
charlesding2024
113mo ago
AI/MLMid

GPT-Image-2 Prompts

Collection of GPT-Image-2 prompts when official docs already exist.

Ship ItNiche Gem
kevinhacker
101mo ago