We built Cobalt, Open source unit testing for AI Agents
Testing framework for AI agents with LLM judges and SQLite result tracking.
Independent framework to test, benchmark, and evaluate LLMs & AI agents locally.
Tests tool calls and trace quality when LangSmith only checks output strings.
LLM agent developers, AI engineering teams
LangSmith · Arize Phoenix · Braintrust
Testing framework for AI agents with LLM judges and SQLite result tracking.
Proves text safety ≠ tool-call safety; catches hidden harmful executions deterministically.
AST-based validation for function calling tests, but BFCL already covers this ground.
VCR for LLM calls—eliminates API costs and non-determinism in agent testing.
pytest-native testing for AI agents with 101 built-in safety attack probes.
Jest for LLMs—CI-native eval that fails builds on quality drops, not dashboards.