TrainForgeTester – deterministic scenario tests for AI agents
Replaces flaky LLM judges with strict Python equality checks for tool arguments.
Behavioral testing framework for AI agents
VCR cassettes for agent tool sequences—catches prompt regressions before deploy.
AI/LLM engineers, agent builders, ML ops teams
VCR.js (HTTP cassettes) · Pytest fixtures for mocking
So I built TracePact. Record a known-good run as a cassette, diff against new runs, get a clear report of what changed:
- read_file (seq 0) (removed) ~ bash.cmd: "npm test" -> "npm run build" Summary: 1 removed, 1 arg changed[BLOCK]
It classifies changes as block (structural) or warn (args only), so you can gate CI with --fail-on warn. You can filter noise with --ignore-keys timestamp and --ignore-tools read_file.
No API calls needed for replay/diff. Works with any LLM provider. Vitest integration for writing assertions on tool traces.
Replaces flaky LLM judges with strict Python equality checks for tool arguments.
CSV-based agent testing works but LangSmith already owns this evaluation workflow.
Automates post-deployment log diffing across Kubernetes and Datadog dashboards.
Behavioral safety testing reveals 45 regressions static analysis misses—guardrails provided.
Replay-to-test automation closes the loop better than LogRocket's manual workflows.
Finally, pytest for AI tool calls when evals only test intelligence.