Attest – Test AI agents with 8-layer graduated assertions
8-layer assertion pipeline cuts LLM-judge calls by ~80%—free layers handle deterministic checks first.
Open-source testing framework for AI agents. Deterministic assertions first, LLM-as-judge last.
60–70% of what makes an agent correct is fully deterministic: tool call schemas, execution order, cost budgets, content format. Routing all of this through an LLM judge is expensive, slow, and unnecessarily non-deterministic. Attest exhausts deterministic checks first and only escalates when necessary.
The 8 layers: schema validation → cost/perf constraints → trace structure (tool ordering, loop detection) → content validation → semantic similarity via local ONNX embeddings (no API key) → LLM-as-judge → simulation with fault injection → multi-agent trace tree evaluation.
Example:
from attest import agent, expect from attest.trace import TraceBuilder
@agent("support-agent") def support_agent(builder: TraceBuilder, user_message: str): builder.add_tool_call(name="lookup_user", args={"query": user_message}, result={...}) builder.add_tool_call(name="reset_password", args={"user_id": "U-123"}, result={...}) builder.set_metadata(total_tokens=150, cost_usd=0.005, latency_ms=1200) return {"message": "Your temporary password is abc123."}
def test_support_agent(attest): result = support_agent(user_message="Reset my password") chain = ( expect(result) .cost_under(0.05) .tools_called_in_order(["lookup_user", "reset_password"]) .output_contains("temporary password") .output_similar_to("password has been reset", threshold=0.8) ) attest.evaluate(chain)
The .output_similar_to() call runs locally via ONNX Runtime — no embeddings API key required. Layers 1–5 are free or near-free. The LLM judge is only invoked for genuinely subjective quality assessment.Architecture: single Go binary engine (1.7ms cold start, <2ms for 100-step trace eval) with thin Python and TypeScript SDKs. All evaluation logic lives in the engine — both SDKs produce identical assertion results. 11 adapters covering OpenAI, Anthropic, Gemini, Ollama, LangChain, Google ADK, LlamaIndex, CrewAI, and OpenTelemetry.
v0.4.0 adds continuous evaluation with σ-based drift detection, a plugin system, result history, and CLI scaffolding. The engine and Python SDK are stable across four releases. The TypeScript SDK is newer — API is stable, hasn't been battle-tested at scale yet.
The simulation runtime is the part I'm most curious about feedback on. You can define persona-driven simulated users (friendly, confused, adversarial), inject faults (latency, errors, rate limits), and run your agent against all of them in a single test suite. Is this useful in practice for CI, or is it a solution looking for a problem?
Apache 2.0 licensed. No platform to self-host, no BSL, no infrastructure requirements.
GitHub: https://github.com/attest-framework/attest Examples: https://github.com/attest-framework/attest-examples Website: https://attest-framework.github.io/attest-website/ Install: pip install attest-ai / npm install @attest-ai/core
8-layer assertion pipeline cuts LLM-judge calls by ~80%—free layers handle deterministic checks first.
1,200 security rules for AI agents when OWASP Agentic Top 10 just dropped.
pytest-native testing for AI agents with 101 built-in safety attack probes.
Yet another Go assertion library competing directly with testify's established ecosystem.
Pytest syntax for LLM testing avoids LLM-judge cost, but feature parity vs. LangSmith and Braintrust unproven.
Yet another CLI test framework; BATS, Aruba, and pytest-shell already solve this.