OctopusGarden – An autonomous software factory (specs in, code out)

Name: OctopusGarden – An autonomous software factory (specs in, code out)
Availability: InStock
Author: foundatron

by foundatron·Mar 3, 2026·8 points·5 comments

Visit Project View on HN

AI Analysis

●MidBold BetShip It

Orchestrates AI agents to iterate code until tests pass—but StrongDM already ships this.

Strengths

•Holdout scenario validation with probabilistic LLM scoring prevents naive reward hacking.
•Clean feedback loop architecture: spec → generate → build → validate → iterate.
•Open-source reference for a production pattern (StrongDM's) most haven't seen.

Weaknesses

•StrongDM already has a production system doing this exact thing with proven results.
•Entirely dependent on API-based LLMs (Claude, GPT); no local or cost-aware fallback shown.
•Weekend project status—unclear code maturity, test coverage, or real-world convergence rates.

Post Description

I built this over the weekend after reading about StrongDM's software factory (their writeup: https://factory.strongdm.ai/, Simon Willison's deep dive: https://simonwillison.net/2026/Feb/7/software-factory/, Dan Shapiro's Five Levels: https://www.danshapiro.com/blog/2026/01/the-five-levels-from...). OctopusGarden is an open-source implementation of the pattern StrongDM described: holdout scenarios, probabilistic satisfaction scoring via LLM-as-judge, and a convergence loop that iterates until the code works; no human code review in the loop.

What stood out to me was that this architecture largely rhymes with the coding workflows I and others already do with coding agents. It's basically automating the connective tissue between the workflows I was already doing in Claude Code, and then brute-forcing a result. In the dark factory model, a spec goes in, code gets generated, built in Docker, validated against scenarios the agent never saw, scored, and failures feed back until it converges.

I've tried it with mostly standard CRUD/REST API apps and it works. I haven't tried anything with HTML/JS yet. You can try the sample specs in the repo.

Some raw notes from the experience:

1. I don't want to maintain the code these factories generate. It works. The phenotype is (largely) correct, but the genotype is pretty wild and messy. I did not use OctopusGarden to build OctopusGarden (you can tell because it uses strict linting and tests). I know the point of these systems is zero human in the loop, but I think there's a real opportunity to get factories to generate code that humans actually want to maintain. I'm going to work on getting OctopusGarden there.

2. Compliance might be a nightmare. In my day job I think a lot about ISO 27001 and SOC 2 compliance. The idea of deploying dark-factory-generated projects into my environments and checking compliance boxes sounds painful. That might just be the current state of OctopusGarden and the code it generates, but I think we can get to a point where generated code is completely linted, statically checked, and tested inside the factory. That's not OctopusGarden today, but maybe it will be there next week? I can see this moving fast.

3. These dark factory apps will be hard to debug. There was a Claude outage today and I couldn't run my smoke tests or generate new apps. I don't want to maintain services that can't be debugged and fixed by a human in a pinch. We're already partially there with AI-assisted code, but this factory-generated code is even more convoluted. Requiring AI to create a new app version is probably worth it...but it's still yet another thing between you and quickly patching an urgent bug.

4. Security needs a better story. These things need real security hardening. Maybe that's just better spec files and scenarios, maybe it's something more. I'm going to drink a strong cola and think about this one.

5. The unit of responsibility keeps growing. Last year we said code must come in PR-sized bites — that's how we manage risk. Now we're talking about deploying meshes of services created and deployed with no humans in the loop (except at creation). AI-generated services could really push the scale of what people are willing to accept responsibility for. Most SRE teams manage 1-5 services at big companies. Will that number increase per team? How much GDP is one person willing to manage via agents? Just a shower thought.

6. I was surprised this works. I'm surprised at how easy it was to make. I'm surprised more of these aren't out there already. I only did a couple of GitHub searches and didn't find many. I'm bad at searching. Sorry if I didn't find your project.

Similar Projects

AI/ML●●Solid

The Rouge is my attempt at an AI product factory

Runs Claude Code with --dangerously-skip-permissions to ship MVPs overnight.

Bold BetBig Brain

gr3gario

311mo ago

Developer Tools●●Solid

Turn any OpenAPI spec into agent-callable skills

It extracts focused, executable operations from giant OpenAPI files (the GitHub REST YAML is shown) to shrink context and avoid sidecar adapter sprawl — a pragmatic answer to token bloat and brittle ad-hoc integrations. Useful and concrete: if it actually generates tidy, updateable skill units and runtime hooks it saves a lot of maintenance. That said, the idea competes with existing LangChain/openai-function patterns; the repo will need clear runtime, versioning, and update strategies to feel like more than a nicer converter.

Solve My ProblemNiche Gem

yz-yu

103mo ago

AI/ML●Mid