Back to browse
GitHub Repository
69 starsTypeScript

Cobalt – Unit tests for AI agents, like Jest but for LLMs

by fdefitte·Feb 20, 2026·3 points·0 comments

AI Analysis

●●●BangerShip ItSolve My ProblemBig Brain

Jest for LLMs—CI-native eval that fails builds on quality drops, not dashboards.

Strengths
  • Solves real problem: eval tools force UI workflows instead of code-driven CI gates
  • Jest parallel with LLM testing is genuinely smart framing; lives in code, not dashboards
  • Integrates existing platforms (Langfuse, LangSmith, Braintrust) instead of siloing data
Weaknesses
  • Early stage (38 stars, no GitHub metrics on maturity); needs production usage stories
  • Crowd already building LLM eval (Evals framework, Braintrust SDK); differentiation depends on DX polish
Target Audience

AI/LLM application developers building agents, teams using Langfuse or LangSmith for observability

Similar To

Braintrust Evals · LangSmith evaluators · Arize eval SDK

Post Description

Hey HN, I built Cobalt, an open-source testing framework for AI agents and LLM apps.

Most eval tools (Braintrust, Arize, LangSmith) want you to live in their UI. Dashboards, manual reviews, clicking through results. That's fine for exploration, but it doesn't catch regressions. We needed something that runs in CI like any other test suite, lives in code, and fails the build when quality drops.

npm install @basalt-ai/cobalt npx cobalt init npx cobalt run

Write experiments as code:

import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'

const dataset = Dataset.fromLangfuse('support-tickets')

experiment('support-agent', dataset, async ({ item }) => { const result = await myAgent(item.input) return { output: result } }, { evaluators: [ new Evaluator({ name: 'Helpful', type: 'llm-judge', prompt: 'Is this response helpful and accurate? {{output}}' }), new Evaluator({ name: 'No hallucination', type: 'llm-judge', prompt: 'Does this contain fabricated info? {{output}}' }), ] })

`npx cobalt run --ci` exits with code 1 if thresholds are violated. The GitHub Action posts score tables on PRs and auto-compares against base branch.

The part I'm most excited about: Cobalt ships with a built-in MCP server, so you can drive it entirely from Claude Code. Just tell it "compare GPT 5.2 with 5.1 on my support agent" or "run my experiments, find the failing cases, and fix the prompt." It runs the experiments, diffs the results, and iterates on your code without you leaving the terminal. Turns eval from a chore into a conversation.

Pull datasets from Langfuse, LangSmith, Braintrust, or plain JSON/JSONL/CSV. Results stored locally in SQLite. No accounts, no dashboards, no vendor lock-in.

Similar Projects