IronCurtain: A secure* runtime for AI agent loops
Sandbox agents via natural-language policy, not ambient authority—genuinely novel approach.
Capability-based compiler/runner for reproducible agent scenarios
Fuchsia-inspired capability model for agent benchmarks solves reproducibility existing tools ignore.
AI researchers and engineers building multi-agent systems
Docker Compose · Kubernetes · Fuchsia Component Framework
Amber grew out of the RDI AgentX-AgentBeats benchmarking competition [1] where the general public was invited to submit agents. To ensure trustworthy results, we needed submissions to be reproducible and have clear provenance. Reproducibility motivates declarative specifications of benchmarks, and provenance motivates the ability to safely and efficiently run benchmarks on hosted hardware. Once you add support for multi-phase multi-agent benchmarks (like Werewolf), the design for Amber mostly falls right out.
Amber is inspired by Fuchsia OS Component Framework. The security model of Amber is that a component like an A2A agent or MCP tool only serves a component that has explicitly been given a capability to use it. In the context of benchmarks, this means that an agent under test cannot reach into the evaluator, and that a tool can be revoked in a later phase of a benchmark.
Amber is a combination of a compiler and a runtime system: the compiler turns manifests describing agents, tools, and how they connect to each other into a deterministic plan. The plan can be executed against different backends like Docker, K8s, KVM, or the host OS. The compiler injects runtime components necessary to enforce the capability model: sidecar routers that provide guarded connectivity between components, and backend controllers that allow components to create and destroy components at runtime.
Amber started out with just static `docker compose`, but benches like TerminalBench and OSWorld required the addition of dynamic components and VM-backed components. Then competition participants wanted an easier way to test locally that didn't involve repeatedly rebuilding Docker images, so Amber got native binary support and a one-liner `amber run` interface. The concepts borrowed from Fuchsia have held up so far. Right now I'm working on making Amber's observability traces available to the benchmark evaluator so that it can judge based on the path an agent took, rather than just the final answer.
Overall, the goal we set out to achieve was to make it easy to reproduce agent benchmark results in a low-trust environment. Amber is not a complete solution, but it takes some burden off of benchmark authors and agent builders. Maybe it's even useful beyond benchmarks. I would be happy for you to batter the conceptual framework!
The AgentBeats tau2 benchmark manifest [2] is a real example. The in-tree mixed-site example [3] is a simple demo of Amber end-to-end with `amber run`.
[0]: https://news.ycombinator.com/item?id=47733217
[1]: https://rdi.berkeley.edu/agentx-agentbeats.html
[2]: https://github.com/RDI-Foundation/tau2-agentbeats/blob/main/...
[3]: https://github.com/RDI-Foundation/amber/tree/main/examples/m...
Sandbox agents via natural-language policy, not ambient authority—genuinely novel approach.
Deterministic agent governance with capability tokens beats probabilistic guardrails.
Adds structure layer to AI agents: +9pp pass rate, 93% fault localization on SWE-bench.
Runtime-owned state with recovery rules for agent workflows that outlive a single session.
Seccomp+iptables+mount isolation blocks the ClawdHub credential stealer in practice.
BEAM kernel with deterministic replay solves agent state durability problems.