Back to browse
17MB pronunciation scorer beats human experts at phoneme level

17MB pronunciation scorer beats human experts at phoneme level

by fabiosuizu·Feb 20, 2026·4 points·2 comments

AI Analysis

●●●BangerWizardryBig BrainSolve My Problem

Beats human experts at phoneme scoring while 70x smaller than SOTA models.

Strengths
  • Quantized Citrinet-256 backbone achieves superhuman inter-annotator agreement (0.580 PCC vs 0.555 human) at 17MB—genuine constraint engineering.
  • Sub-300ms CPU latency + REST/MCP/Azure APIs make it production-viable for real-time language learning feedback loops.
  • CTC forced alignment + GOP scoring + ensemble methodology is technically non-obvious and well-benchmarked on standard datasets.
Weaknesses
  • 10-15% raw accuracy gap vs SOTA suggests limited upside for high-precision use cases beyond language learning apps.
  • Demo UI is functional but barebones—no clear onboarding for developers to understand when/why to use this vs larger models.
Category
Target Audience

Language learning app developers, speech-enabled education platforms, ESL instructors

Similar To

Google Cloud Speech-to-Text (pronunciation confidence scoring) · Speechocean762 reference systems (wav2vec2-based) · Azure Speech Service (pronunciation assessment)

Post Description

I built an English pronunciation assessment engine that fits in 17MB and runs in under 300ms on CPU.

Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.

Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances): - Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555) - Sentence accuracy: 0.710 — exceeds human agreement (0.675) - Model is 70x smaller than wav2vec2-based SOTA

Trade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.

Available as REST API, MCP server (for AI agents), and on Azure Marketplace.

Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...

Interested in feedback on the scoring approach and use cases people would find valuable.

Similar Projects

AI/MLMid

Darius – An AI router that selects the best model for each prompt

The product puts model selection behind a friendly chat UI — I can see model tags like XAI:GROK-4-1-FAST-REASONING in the screenshots — and leans hard on privacy and 'no dark patterns' messaging. The UX is clean and approachable, but the routing logic is opaque and this sits in a crowded space of multi-LLM frontends (Poe, Perplexity, etc.), so the value depends on how smart and cost-effective their orchestration actually is.

SlickSolve My Problem
mazenkurdi
304mo ago