Qiaohu – offline multimodal voice assistant on Snapdragon 8 Gen 2
Full voice assistant pipeline with barge-in running entirely offline on Snapdragon GPU.
Outrageous Voice Assistant - Fully local end-to-end ASR + LLM + TTS pipeline using open weight models and a simple web based UI
This repo bundles a complete local audio loop — client captures audio, backend transcribes with Parakeet, queries a quantized Mistral LLM via Ollama, then renders speech with Kokoro or Qwen3-TTS for cloning — and reports ~1s round-trip on an RTX5070. It’s a practical, take-it-home demo for running privacy-first voice agents, though it’s still a demo: requires specific tooling (Ollama, GPU headroom), has obvious TODOs (VAD, better warmup for cloning), and isn’t reinventing the architecture.
AI/ML developers, privacy-conscious hobbyists, researchers experimenting with local speech stacks
Link: https://github.com/acatovic/ova
Models used:
ASR: NVIDIA parakeet-tdt-0.6b-v3 600M LLM: Mistral ministral-3 3b 4-bit quantized TTS (Simple): Hexgrad Kokoro 82M TTS (With Voice Cloning): Qwen3-TTS
It implements a classic ASR -> LLM -> TTS architecture:
1. Frontend captures user's audio and sends a blob of bytes to the backend /chat endpoint
2. Backend parses the bytes, extracts sample rate (SR) and channels, then:
2.1. Transcribes the audio to text using an automatic speech recognition (ASR) model
2.2. Sends the transcribed text to the LLM, i.e. "the brain"
2.3. Sends the LLM response to a text-to-speech (TTS) model
2.4. Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend
3. The frontend plays the response audio back to the user
I've had a number of people try it out with great success and you can potentially take it any direction, e.g. give it more capabilities so it can offload "hard" tasks to larger models or agents, enable voice streaming, give it skills or knowledge, etc.
Enjoy!
Full voice assistant pipeline with barge-in running entirely offline on Snapdragon GPU.
Sub-sentence TTS streaming beats Piper/Sherpa-ONNX latency by token-level triggering on CPU.
Shrinks the usual TTS bloat into a 16MB Electron-alternative wrapper while still letting you clone voices from a short sample and 'design' voices from text prompts. It handles model downloads for you, supports batch exports and macOS auto-updates — smart product trade-offs. Caveat: the app binary is tiny, but the underlying TTS models are downloaded on demand, so expect large model pulls behind the scenes.
Full-stack browser voice AI (WebLLM, Whisper.cpp, VITS) running 100% locally and client-side.
One Rust binary does what Electron apps and Python scripts couldn't for Linux dictation.
This intentionally avoids generative LLMs and instead stitches together Whisper, Piper, spaCy, VADER, sumy and YOLO into a deterministic, local assistant — a practical tradeoff that kills API bills and prompt-injection risk. The blog feature (extractive summarization + site crawling) is an especially smart move: it produces usable titles/content without hallucination. It won't replace creative LLM outputs, but for offline, private automation this is a refreshingly pragmatic build.