Real-time local TTS (31M params, 5.6x CPU, voice cloning, ONNX)
5.6x realtime on CPU with voice cloning beats most local TTS options.
Sub-sentence TTS streaming beats Piper/Sherpa-ONNX latency by token-level triggering on CPU.
ML engineers, roboticists, edge AI developers building privacy-first voice applications on constrained hardware.
Piper TTS · Sherpa-ONNX · Ollama (local LLM)
The goal was to see how responsive a LLM → speech system can be on normal laptops or edge devices.
It includes: - Voice Activity Detection - CPU-friendly LLM + TTS streaming - Async pipeline to reduce latency
Modular LLM backend
Useful for local assistants, robotics prototypes, privacy-first setups, or benchmarking STT/LLM/TTS latency.
We’ve been experimenting with similar CPU-first pipelines inside NEO workflows for on-device agents, and this repo is a minimal standalone version.
Would love suggestions on lightweight STT/TTS models or latency tricks people have used on CPU.
5.6x realtime on CPU with voice cloning beats most local TTS options.
Outperformed Vapi 2× on latency by treating voice as turn-taking, not transcription.
SOTA expressivity at 14M parameters beats cloud models for on-device TTS.
Kokoro voice cloning with multilingual support, but voice cloning itself is crowded.
Nine personality modes are prompt variations wrapped in Tkinter with Groq API.
This repo bundles a complete local audio loop — client captures audio, backend transcribes with Parakeet, queries a quantized Mistral LLM via Ollama, then renders speech with Kokoro or Qwen3-TTS for cloning — and reports ~1s round-trip on an RTX5070. It’s a practical, take-it-home demo for running privacy-first voice agents, though it’s still a demo: requires specific tooling (Ollama, GPU headroom), has obvious TODOs (VAD, better warmup for cloning), and isn’t reinventing the architecture.