Vocab extractor for language learners using Stanza and frequency ranks

Name: Vocab extractor for language learners using Stanza and frequency ranks
Availability: InStock
Author: crivlaldo

by crivlaldo·Mar 29, 2026·6 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidNiche GemBig Brain

Classical NLP beats LLMs at knowing which words you don't know.

Strengths

•Frequency ranks + Stanza pipeline avoids LLM hallucination on difficulty levels.
•Released 43K collocation dataset extracted from 100M+ subtitle lines.

Weaknesses

•UI is basic Hugging Face Spaces template with limited customization options.
•Loses to LLMs on multi-word phrase intuition and context.

Post Description

I'm building a Telegram bot to practice Dutch. GPT-4o-mini kept picking vocabulary words I already knew, so I built a classical NLP pipeline to do it instead.

It takes a short text + learner level (A0–B1) and returns the best words to study, using Stanza for parsing and corpus frequency ranks (SUBTLEX-NL, srLex, SUBTLEX-US) for scoring. Wins at A1/A2, loses at A0 where the LLM picks more obvious words.

I also tried adding multi-word phrases (ADJ+NOUN, VERB+NOUN, phrasal verbs) backed by NPMI-scored collocation whitelists. Couldn't beat GPT there because it just "knows" which phrases matter.

For the phrase work I had to extract collocations from 100M+ OpenSubtitles lines. Published them as a free dataset: https://huggingface.co/datasets/vladvlasov256/opensubs-collo... There are 43K bigrams across English, Dutch, and Serbian.

Source https://github.com/vladvlasov256/vocab-nlp