Back to browse
GitHub Repository
8 starsPython

Kitten TTS Based Low-Latency Streaming Voice Assistant on CPU

by gauravvij137·Feb 26, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemWizardry

Sub-sentence TTS streaming beats Piper/Sherpa-ONNX latency by token-level triggering on CPU.

Strengths
  • Genuine latency innovation: 1.25s TTFA via token-based synthesis vs sentence boundaries is measurably faster
  • Thorough benchmarking and comparison table (Piper, Sherpa-ONNX) with concrete metrics
  • 32-core ONNX tuning and adaptive buffering show real optimization thought, not just 'runs on CPU'
Weaknesses
  • Narrow audience: local voice assistants on CPU is a niche within edge AI, not a broad product category
  • NEO AI agent narrative feels performative; core value is the latency trick, which is already stated upfront
Target Audience

ML engineers, roboticists, edge AI developers building privacy-first voice applications on constrained hardware.

Similar To

Piper TTS · Sherpa-ONNX · Ollama (local LLM)

Post Description

We asked Neo AI to build a small voice assistant pipeline that runs with low latency on CPU instead of requiring a GPU.

The goal was to see how responsive a LLM → speech system can be on normal laptops or edge devices.

It includes: - Voice Activity Detection - CPU-friendly LLM + TTS streaming - Async pipeline to reduce latency

Modular LLM backend

Useful for local assistants, robotics prototypes, privacy-first setups, or benchmarking STT/LLM/TTS latency.

We’ve been experimenting with similar CPU-first pipelines inside NEO workflows for on-device agents, and this repo is a minimal standalone version.

Would love suggestions on lightweight STT/TTS models or latency tricks people have used on CPU.

Similar Projects

AI/ML●●Solid

KokoClone – Zero-shot voice cloning using Kokoro TTS

Kokoro voice cloning with multilingual support, but voice cloning itself is crowded.

Niche GemShip It
Ashish106
213mo ago
AI/ML●●Solid

Local Voice Assistant

This repo bundles a complete local audio loop — client captures audio, backend transcribes with Parakeet, queries a quantized Mistral LLM via Ollama, then renders speech with Kokoro or Qwen3-TTS for cloning — and reports ~1s round-trip on an RTX5070. It’s a practical, take-it-home demo for running privacy-first voice agents, though it’s still a demo: requires specific tooling (Ollama, GPU headroom), has obvious TODOs (VAD, better warmup for cloning), and isn’t reinventing the architecture.

WizardryNiche Gem
armcat
203mo ago