Back to browse
Vocab extractor for language learners using Stanza and frequency ranks

Vocab extractor for language learners using Stanza and frequency ranks

by crivlaldo·Mar 29, 2026·6 points·0 comments

AI Analysis

●●SolidNiche GemBig Brain

Classical NLP beats LLMs at knowing which words you don't know.

Strengths
  • Frequency ranks + Stanza pipeline avoids LLM hallucination on difficulty levels.
  • Released 43K collocation dataset extracted from 100M+ subtitle lines.
Weaknesses
  • UI is basic Hugging Face Spaces template with limited customization options.
  • Loses to LLMs on multi-word phrase intuition and context.
Category
Target Audience

Language learners, ESL teachers, EdTech developers

Similar To

Readlang · LingQ · Clozemaster

Post Description

I'm building a Telegram bot to practice Dutch. GPT-4o-mini kept picking vocabulary words I already knew, so I built a classical NLP pipeline to do it instead.

It takes a short text + learner level (A0–B1) and returns the best words to study, using Stanza for parsing and corpus frequency ranks (SUBTLEX-NL, srLex, SUBTLEX-US) for scoring. Wins at A1/A2, loses at A0 where the LLM picks more obvious words.

I also tried adding multi-word phrases (ADJ+NOUN, VERB+NOUN, phrasal verbs) backed by NPMI-scored collocation whitelists. Couldn't beat GPT there because it just "knows" which phrases matter.

For the phrase work I had to extract collocations from 100M+ OpenSubtitles lines. Published them as a free dataset: https://huggingface.co/datasets/vladvlasov256/opensubs-collo... There are 43K bigrams across English, Dutch, and Serbian.

Source https://github.com/vladvlasov256/vocab-nlp

Similar Projects