Back to browse
GitHub Repository

Multi-model AI orchestration platform — route coding tasks through specialized LLM agents with consensus, adversarial review, and an independent Arbiter layer. Model-agnostic. Plugin any LLM. Ship better code.

6 starsPython

CRTX – AI code gen that tests and fixes its own output (OSS)

by johnnycash926·Feb 21, 2026·2 points·1 comment

AI Analysis

●●●BangerBig BrainShip ItZero to One

Ditched multi-model bloat, proved single model + local test loop beats expensive debate.

Strengths
  • Ruthless benchmarking: 94% quality at $0.36 beats $5.59 multi-model pipeline on cost and output consistency
  • The Loop (generate→test→fix→review) mirrors real developer workflow instead of chaining prompts
  • Independent Arbiter enforces consensus without compounding errors; repeatable escalation tiers on stall
Weaknesses
  • Limited to pytest-compatible Python; no TypeScript, Go, or other ecosystems in initial release
  • Benchmarks on only 12 prompts across three task types; scaling to complex enterprise codebases unproven
Category
Target Audience

Backend developers, Python developers, anyone using AI code generation in workflows

Similar To

GitHub Copilot · Continue.dev · Cursor

Post Description

We built an open-source CLI that generates code, runs tests, fixes failures, and gets an independent AI review — all before you see the output. We started with a multi-model pipeline where different AI models handled different stages (architect, implement, refactor, verify). We assumed more models meant better code. Then we benchmarked it: 39% average quality score at $4.85 per run. A single model scored 94% at $0.36. Our pipeline was actively making things worse. So we killed it and rebuilt around what developers actually do when they get AI-generated code: run it, test it, fix what breaks. The Loop generates code, runs pytest automatically, feeds failures back for targeted fixes, and repeats until all tests pass. Then an independent Arbiter (always a different model than the generator) reviews the final output. Latest benchmark across three tasks (simple CLI, REST API, async multi-agent system): Single Sonnet: 94% avg, 10 min dev time, $0.36 Single o3: 81% avg, 4 min dev time, $0.44 Multi-model: 88% avg, 9 min dev time, $5.59 CRTX Loop: 99% avg, 2 min dev time, $1.80 "Dev time" estimates how long a developer would spend debugging the output before it's production-ready. The Loop's hardest prompt produced 127 passing tests with zero failures. When the Loop hits a test it can't fix, it has a three-tier escalation: diagnose the root cause before patching, strip context to just the failing test and source file, then bring in a different model for a second opinion. The goal is zero dev time on every run. Model-agnostic — works with Claude, GPT, o3, Gemini, Grok, DeepSeek. Bring your own API keys. Apache 2.0. pip install crtx https://github.com/CRTXAI/crtx We published the benchmark tool too — run crtx benchmark --quick to reproduce our results with your own keys. Curious what scores people get on different providers and tasks.

Similar Projects