Back to browse
GitHub Repository

Local-first CLI that turns Markdown scripts into multi-speaker podcast-style audio using Coqui XTTS v2.

33 starsPython

Podvoice – Local-first CLI to turn Markdown into multi-speaker audio

by aman179102·Feb 21, 2026·1 point·0 comments

AI Analysis

●●●BangerSolve My ProblemNiche Gem

Local multi-speaker TTS CLI with zero cloud dependencies beats ElevenLabs for podcast scripts.

Strengths
  • Fully local inference with Coqui XTTS v2 eliminates API costs, latency, and data privacy concerns for reproducible audio.
  • Clean Markdown syntax for speaker/emotion blocks is genuinely intuitive—lower barrier than training scripts or parameter tuning.
  • Small, modular Python codebase with GPU-optional execution makes it hackable for beginners and sustainable for maintainers.
Weaknesses
  • Emotion tags parsed but not interpreted by XTTS—future work, not shipped differentiation today.
  • Initial model download and multi-speaker inference are slow; GPU requirement limits adoption on resource-constrained machines.
Target Audience

Content creators, developers, and podcast producers who want offline TTS workflows.

Similar To

ElevenLabs API · Google Cloud TTS · Eleven Mono/Multilingual

Post Description

Hi HN,

I built Podvoice because I wanted a simple way to turn Markdown podcast-style scripts into audio without relying on cloud TTS APIs.

It runs fully locally using Coqui XTTS v2. No API keys. No accounts. Just a CLI workflow.

You write something like:

[Host | calm] Hello and welcome.

[Guest | excited] Let’s talk about AI.

And it generates a single stitched audio file.

Would love feedback on the idea, UX, or use cases I might be missing.

Similar Projects

AI/ML●●Solid

Podscript – Podcast/YouTube Transcription CLI

Outputs ready-to-use Markdown with speaker diarization and timestamps, accepts Apple Podcasts/YouTube/RSS links, and can run fully locally or use ElevenLabs for higher-quality diarization. Not groundbreaking — speech-to-text pipelines already exist — but the one-command UX, RSS browsing/search flags, and explicit local-mode make it genuinely useful for folks who want tidy transcripts without wiring together multiple tools.

Solve My ProblemNiche Gem
timf34
103mo ago
SaaS●●Solid

Transcriptum – fast video transcription with speaker labels and summary

It pairs WhisperX-grade transcription (speaker diarization and word-level timestamps) with optional multi-LLM analysis — summaries, Q&A, sentiment, topics and even fact-checking — plus YouTube import and standard export formats. Being vendor-agnostic and offering fact-checking is a smart differentiator, but the space is crowded (Descript/Otter/etc.); clearer accuracy numbers, pricing, or unique workflow hooks would make this stand out.

Solve My ProblemSlick
lpeancovschi
103mo ago