GitHub Repository

GreedyPhrase Tokenizer: Maximizing Effective Context via Greedy Phrase Compression

9 starsC++

GreedyPhrase – 1.21x better compression than GPT-4o tiktoken, 6x faster

Name: GreedyPhrase – 1.21x better compression than GPT-4o tiktoken, 6x faster
Availability: InStock
Author: bazlightyear

by bazlightyear·Feb 18, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainWizardry

Phrase-mining beats tiktoken compression 1.21x with 1/3 vocab size, but niche for token optimization.

Strengths

•Rigorous multi-dataset benchmarking (enwik9, WikiText-103, TinyStories) shows consistent gains; 2.24x compression on repetitive prose is substantial.
•Iterative compounding strategy (phrase mining → bigram/trigram merging → BPE fallback) is elegant; scales to gigabyte corpora without OOMing on n-gram counts.
•C backend parallelization (12-thread xxHash, mmap + speculative prefetch) achieves 36–47 MB/s throughput vs. tiktoken's 4–11 MB/s.

Weaknesses

•Niche use case: only helps when context window or inference speed is bottleneck; most LLM users won't feel the difference.
•No integration with popular LLM frameworks (Hugging Face, Ollama, LiteLLM); standalone CLI/Python API only—adoption friction is high.

Post Description

A greedy phrase-based tokenizer that outperforms GPT-4 and GPT-4o tokenizers on compression, with a smaller vocabulary.

Benchmark (enwik9, 1 GB):

Tokenizer Vocab Size Total Tokens Ratio Throughput GreedyPhrase 65,536 222,805,405 4.49x 47 MB/s Tiktoken o200k_base (GPT-4o) 200,019 270,616,861 3.70x 4.35 MB/s Tiktoken cl100k_base (GPT-4) 100,277 273,662,103 3.65x 7.13 MB/s

GreedyPhrase: 1.23x better than GPT-4, 1.21x better than GPT-4o. 1.5-3x smaller vocab, 6-11x higher encoding throughput.

How It Works:

1. Phrase Mining — Split into atoms (words, punctuation, whitespace). Mine bigrams/trigrams. Top phrases fill 95% vocab slots.

2. BPE Fallback — Train BPE on residual byte sequences. Fills remaining 5% vocab.

3. Greedy Encoding — Longest-match-first Trie. Byte fallback for unknowns (zero OOV).

Similar Projects

Developer Tools●●●Banger

Quicktok, an exact BPE tokenizer 7x faster than tiktoken

Zero-dependency C++20 tokenizer hitting 92 MB/s while matching tiktoken output byte-for-byte.

WizardryBig Brain

dmatth1

3116d ago

AI/ML●Mid

Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

21x KV-cache restore speedup sounds huge, but the Medium link returns a 500 error.

Bold Bet

Wetime

1071mo ago

Developer Tools●●●Banger

Codebook of 450k+ unique words and phrases acts as a text compressor

450k-entry codebook compresses text by 61% using phrase substitution.

Big BrainWizardry

smalltorch

1112mo ago

AI/ML●●●Banger

Wordchipper – Rust BPE tokenizer, 9x faster than tiktoken

Nine times faster than tiktoken-rs with swappable lexer backends for benchmarking.

WizardryBig Brain

antimora

203mo ago

Data●●●Banger

chwire – Native-Format ClickHouse JavaScript Client over HTTP/TCP

Native binary format beats JSONEachRow by 2–8x on large payloads with ZSTD compression.

Big BrainWizardry

maxjustus

209d ago

AI/ML●●●Banger

Andrej Karpathy's microgpt.py to C99 microgpt.c – 4,600x faster

Pure C99 GPT with SIMD beats Python 4,600x; drop two files into any project.

WizardryZero to One

Ajay__soni

4035mo ago