GitHub Repository

Outrageous Voice Assistant - Fully local end-to-end ASR + LLM + TTS pipeline using open weight models and a simple web based UI

160 starsPython

Local Voice Assistant

Name: Local Voice Assistant
Availability: InStock
Author: armcat

by armcat·Feb 17, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryCozy

Qwen3-TTS voice cloning without finetuning in ~1 second on RTX 5070.

Strengths

•Complete ASR→LLM→TTS pipeline runs locally with no data sent to external APIs.
•Voice cloning via Qwen3-TTS requires only 3-5 second wav clip, no finetuning needed.
•Open source with specific model versions and ~1 second round-trip performance metrics.

Weaknesses

•Demo project, not a product—missing Voice Activity Detection and task orchestration.
•Local voice assistants exist: Rhasspy, Home Assistant Voice, and Voiceflow already serve this.

Post Description

Several weeks ago I built a fully-local voice assistant demo with a FastAPI backend and a simple HTML front-end. All the models (ASR / LLM / TTS) are open weight and running locally, i.e. no data is being sent to the Internet nor any API. It's intended to demonstrate how easy it is to run a fully-local AI setup on affordable commodity hardware, while also demonstrating the uncanny valley and teasing out the ethical considerations of such a setup - it allows you to perform voice cloning.

Link: https://github.com/acatovic/ova

Models used:

ASR: NVIDIA parakeet-tdt-0.6b-v3 600M LLM: Mistral ministral-3 3b 4-bit quantized TTS (Simple): Hexgrad Kokoro 82M TTS (With Voice Cloning): Qwen3-TTS

It implements a classic ASR -> LLM -> TTS architecture:

1. Frontend captures user's audio and sends a blob of bytes to the backend /chat endpoint

2. Backend parses the bytes, extracts sample rate (SR) and channels, then:

2.1. Transcribes the audio to text using an automatic speech recognition (ASR) model

2.2. Sends the transcribed text to the LLM, i.e. "the brain"

2.3. Sends the LLM response to a text-to-speech (TTS) model

2.4. Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend

3. The frontend plays the response audio back to the user

I've had a number of people try it out with great success and you can potentially take it any direction, e.g. give it more capabilities so it can offload "hard" tasks to larger models or agents, enable voice streaming, give it skills or knowledge, etc.

Enjoy!