We built Cobalt, Open source unit testing for AI Agents
Testing framework for AI agents with LLM judges and SQLite result tracking.
Jest for LLMs—CI-native eval that fails builds on quality drops, not dashboards.
AI/LLM application developers building agents, teams using Langfuse or LangSmith for observability
Braintrust Evals · LangSmith evaluators · Arize eval SDK
Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.
npm install @basalt-ai/cobalt npx cobalt init npx cobalt run
Write experiments as code:import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'
const dataset = Dataset.fromLangfuse('support-tickets')
experiment('support-agent', dataset, async ({ item }) => { const result = await myAgent(item.input) return { output: result } }, { evaluators: [ new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }), new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }), ] })
`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.
Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.
Testing framework for AI agents with LLM judges and SQLite result tracking.
MCP starter kit with 30+ security tests, but it's a template—not a finished product or tool.
pytest-native testing for AI agents with 101 built-in safety attack probes.
Tests tool calls and trace quality when LangSmith only checks output strings.
Jest GUI with built-in coverage and snapshot updates when VS Code extensions already exist.
Agent-native eval workflow beats LangSmith's manual dashboard setup.