GitHub Repository

Spec-driven multi-agent orchestration — autonomous development workforce powered by Claude & OpenHands

63 starsPython

OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)

Name: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)
Availability: InStock
Author: kanddle

by kanddle·Mar 5, 2026·2 points·2 comments

Visit Project View on HN

AI Analysis

●●●BangerWizardryBold Bet

DAG-based agent swarms with spec generation from codebase beat prompt chaining, but long-term reliability unproven.

Strengths

•Core insight: agents fail because nobody verifies output satisfies the goal; automated spec-driven verification is genuinely clever
•Dependency graphs + sandbox isolation + supervisor agent reduces manual review burden vs naive agent loops
•190K LOC with FastAPI, PostgreSQL, Next.js implies real production infrastructure, not toy

Weaknesses

•No public case studies or uptime metrics; GitHub shows 30 stars and minimal community
•Risk: 'autonomous development' remains aspirational; agent reliability at scale is still an open problem

Post Description

AI coding agents generate decent code. The problem is everything around the code - checking progress, catching drift, deciding if it's actually done. I spent months trying to make autonomous agents work. The bottleneck was always me.

Attempt 1 - Claude/GPT directly: works for small stuff, but you re-explain context endlessly.

Attempt 2 - Copilot/Cursor: great autocomplete, still doing 95% of the thinking.

Attempt 3 - continuous agents: keeps working without prompting, but "no errors" doesn't mean "feature works."

Attempt 4 - parallel agents: faster wall-clock, but now you're manually reviewing even more output.

The common failure: nobody verifies whether the output satisfies the goal. That somebody was always me. So I automated that job.

OmoiOS is a spec-driven orchestration system. You describe a feature, and it:

1. Runs a multi-phase spec pipeline (Explore > Requirements > Design > Tasks) with LLM evaluators scoring each phase. Retry on failure, advance on pass. By the time agents code, requirements have machine-checkable acceptance criteria.

2. Spawns isolated cloud sandboxes per task. Your local env is untouched. Agents get ephemeral containers with full git access.

3. Validates continuously - a separate validator agent checks each task against acceptance criteria. Failures feed back for retry. No human in the loop between steps.

4. Discovers new work - validation can spawn new tasks when agents find missing edge cases. The task graph grows as agents learn.

What's hard (honest):

- Spec quality is the bottleneck. Vague spec = agents spinning. - Validation is domain-specific. API correctness is easy. UI quality is not. - Discovery branching can grow the task graph unexpectedly. - Sandbox overhead adds latency per task. Worth it, but a tradeoff. - Merging parallel branches with real conflicts is the hardest problem. - Guardian monitoring (per-agent trajectory analysis) has rough edges still.

Stack: Python/FastAPI, PostgreSQL+pgvector, Redis (~190K lines). Next.js 15 + React Flow (~83K lines TS). Claude Agent SDK + Daytona Cloud. 686 commits since Nov 2025, built solo. Apache 2.0.

I keep coming back to the same problem: structured spec generation that produces genuinely machine-checkable acceptance criteria. Has anyone found an approach that works for non-trivial features, or is this just fundamentally hard?

GitHub: https://github.com/kivo360/OmoiOS Live: https://omoios.dev