GitHub Repository

Open-source testing framework for AI agents. Deterministic assertions first, LLM-as-judge last.

3 starsPython

Attest – Test AI agents with 8-layer graduated assertions

Name: Attest – Test AI agents with 8-layer graduated assertions
Availability: InStock
Author: tommathews

by tommathews·Feb 23, 2026·1 point·0 comments

Visit Project View on HN

Post Description

I built Attest because every team I've seen building AI agents ends up writing the same ad-hoc pytest scaffolding — checking if the right tools were called, if cost stayed under budget, if the output made semantic sense. It works until the agent gets complex, then it collapses.

60–70% of what makes an agent correct is fully deterministic: tool call schemas, execution order, cost budgets, content format. Routing all of this through an LLM judge is expensive, slow, and unnecessarily non-deterministic. Attest exhausts deterministic checks first and only escalates when necessary.

The 8 layers: schema validation → cost/perf constraints → trace structure (tool ordering, loop detection) → content validation → semantic similarity via local ONNX embeddings (no API key) → LLM-as-judge → simulation with fault injection → multi-agent trace tree evaluation.

Example:

from attest import agent, expect from attest.trace import TraceBuilder

@agent("support-agent") def support_agent(builder: TraceBuilder, user_message: str): builder.add_tool_call(name="lookup_user", args={"query": user_message}, result={...}) builder.add_tool_call(name="reset_password", args={"user_id": "U-123"}, result={...}) builder.set_metadata(total_tokens=150, cost_usd=0.005, latency_ms=1200) return {"message": "Your temporary password is abc123."}

def test_support_agent(attest): result = support_agent(user_message="Reset my password") chain = ( expect(result) .cost_under(0.05) .tools_called_in_order(["lookup_user", "reset_password"]) .output_contains("temporary password") .output_similar_to("password has been reset", threshold=0.8) ) attest.evaluate(chain)

The .output_similar_to() call runs locally via ONNX Runtime — no embeddings API key required. Layers 1–5 are free or near-free. The LLM judge is only invoked for genuinely subjective quality assessment.

Architecture: single Go binary engine (1.7ms cold start, <2ms for 100-step trace eval) with thin Python and TypeScript SDKs. All evaluation logic lives in the engine — both SDKs produce identical assertion results. 11 adapters covering OpenAI, Anthropic, Gemini, Ollama, LangChain, Google ADK, LlamaIndex, CrewAI, and OpenTelemetry.

v0.4.0 adds continuous evaluation with σ-based drift detection, a plugin system, result history, and CLI scaffolding. The engine and Python SDK are stable across four releases. The TypeScript SDK is newer — API is stable, hasn't been battle-tested at scale yet.

The simulation runtime is the part I'm most curious about feedback on. You can define persona-driven simulated users (friendly, confused, adversarial), inject faults (latency, errors, rate limits), and run your agent against all of them in a single test suite. Is this useful in practice for CI, or is it a solution looking for a problem?

Apache 2.0 licensed. No platform to self-host, no BSL, no infrastructure requirements.

GitHub: https://github.com/attest-framework/attest Examples: https://github.com/attest-framework/attest-examples Website: https://attest-framework.github.io/attest-website/ Install: pip install attest-ai / npm install @attest-ai/core