Back to browse
GitHub Repository

Live speech translation powered by on-device AI and cloud providers — OpenAI, Google Gemini, Palabra.ai, Kizuna AI, Volcengine, and more

890 starsTypeScript

Sokuji – Open-source speech translator with on-device AI WASM/WebGPU

by jiangzhuo·Mar 5, 2026·2 points·0 comments

AI Analysis

●●●BangerWizardryShip ItSolve My Problem

48 ASR models + WebGPU TTS offline beats Whisper-only alternatives like Otter.ai.

Strengths
  • 136 TTS models across 53 languages with Piper/Coqui/Matcha ensures polyphonic coverage rare in open-source.
  • Browser extension captures participant audio and injects virtual microphone—genuinely useful for Teams/Meet/Discord.
  • Local inference mode means zero API costs and full privacy—genuine differentiator vs cloud-only tools.
Weaknesses
  • WebGPU support limited to Chrome/Edge; Firefox WebGPU still experimental, narrowing cross-browser reach.
  • No offline language packs mentioned; unclear if models download once or stream—critical for truly offline usage.
Target Audience

Users needing private speech translation, developers building multilingual apps, meeting participants in remote calls.

Similar To

Whisper (OpenAI) · Otter.ai · Google Live Translate

Post Description

Hi HN, I built Sokuji, an open-source live speech translation app that runs as both an Electron desktop app and a Chrome/Edge browser extension.

The latest release (v0.15) adds Local Inference mode — fully on-device ASR, translation, and TTS using WASM and WebGPU. No API key, no internet, no data leaving your machine. It ships with:

- 48 ASR models covering 99+ languages (sherpa-onnx WASM + Whisper WebGPU) - 55+ translation language pairs (Opus-MT) plus multilingual LLMs (Qwen 2.5/3/3.5) via WebGPU - 136 TTS models across 53 languages (Piper, Coqui, Mimic3, Matcha)

For those who prefer cloud providers, it also supports OpenAI Realtime API, Google Gemini Live, Palabra.ai, Volcengine ST, Doubao AST 2.0, and any OpenAI-compatible endpoint.

The browser extension integrates with Google Meet, Teams, Zoom, Discord, Slack, and others — it can capture participant audio and inject translated speech via a virtual microphone.

Tech stack: React + Zustand + Vite, Electron Forge, sherpa-onnx compiled to WASM, HuggingFace Transformers.js for WebGPU inference. Models are downloaded on demand and cached in IndexedDB.

I built this because existing translation tools either require expensive API keys, send your audio to the cloud, or don't support enough languages. The local inference mode makes it practical for privacy-sensitive use cases and for people without reliable internet.

AGPL-3.0 licensed. Available on Windows, macOS, Linux, Chrome Web Store, and Edge Add-ons.

GitHub: https://github.com/kizuna-ai-lab/sokuji Offical site: https://sokuji.kizuna.ai

Similar Projects