Back to browse
GitHub Repository

Outrageous Voice Assistant - Fully local end-to-end ASR + LLM + TTS pipeline using open weight models and a simple web based UI

144 starsPython

Local Voice Assistant

by armcat·Feb 17, 2026·2 points·0 comments

AI Analysis

●●SolidWizardryNiche Gem
The Take

This repo bundles a complete local audio loop — client captures audio, backend transcribes with Parakeet, queries a quantized Mistral LLM via Ollama, then renders speech with Kokoro or Qwen3-TTS for cloning — and reports ~1s round-trip on an RTX5070. It’s a practical, take-it-home demo for running privacy-first voice agents, though it’s still a demo: requires specific tooling (Ollama, GPU headroom), has obvious TODOs (VAD, better warmup for cloning), and isn’t reinventing the architecture.

Category
Target Audience

AI/ML developers, privacy-conscious hobbyists, researchers experimenting with local speech stacks

Post Description

Several weeks ago I built a fully-local voice assistant demo with a FastAPI backend and a simple HTML front-end. All the models (ASR / LLM / TTS) are open weight and running locally, i.e. no data is being sent to the Internet nor any API. It's intended to demonstrate how easy it is to run a fully-local AI setup on affordable commodity hardware, while also demonstrating the uncanny valley and teasing out the ethical considerations of such a setup - it allows you to perform voice cloning.

Link: https://github.com/acatovic/ova

Models used:

ASR: NVIDIA parakeet-tdt-0.6b-v3 600M LLM: Mistral ministral-3 3b 4-bit quantized TTS (Simple): Hexgrad Kokoro 82M TTS (With Voice Cloning): Qwen3-TTS

It implements a classic ASR -> LLM -> TTS architecture:

1. Frontend captures user's audio and sends a blob of bytes to the backend /chat endpoint

2. Backend parses the bytes, extracts sample rate (SR) and channels, then:

2.1. Transcribes the audio to text using an automatic speech recognition (ASR) model

2.2. Sends the transcribed text to the LLM, i.e. "the brain"

2.3. Sends the LLM response to a text-to-speech (TTS) model

2.4. Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend

3. The frontend plays the response audio back to the user

I've had a number of people try it out with great success and you can potentially take it any direction, e.g. give it more capabilities so it can offload "hard" tasks to larger models or agents, enable voice streaming, give it skills or knowledge, etc.

Enjoy!

Similar Projects

AI/ML●●Solid

My 16MB vibe-coded voice cloning app

Shrinks the usual TTS bloat into a 16MB Electron-alternative wrapper while still letting you clone voices from a short sample and 'design' voices from text prompts. It handles model downloads for you, supports batch exports and macOS auto-updates — smart product trade-offs. Caveat: the app binary is tiny, but the underlying TTS models are downloaded on demand, so expect large model pulls behind the scenes.

Dark HorseWizardryShip It
yoav
203mo ago
AI/ML●●Solid

Blog and other OpenClaw features without a language model

This intentionally avoids generative LLMs and instead stitches together Whisper, Piper, spaCy, VADER, sumy and YOLO into a deterministic, local assistant — a practical tradeoff that kills API bills and prompt-injection risk. The blog feature (extractive summarization + site crawling) is an especially smart move: it produces usable titles/content without hallucination. It won't replace creative LLM outputs, but for offline, private automation this is a refreshingly pragmatic build.

Big BrainNiche Gem
safestclaw
103mo ago