Time-travel debugging and side-by-side diffs for AI agents
Replay, fork, diff, eval agent traces locally—like Git for agent behavior, fills a real gap.
Postman for AI - design, evaluate, and debug LLM interactions with full transparency.
Local-first Postman for AI agents when LangSmith requires cloud accounts.
Developers building and debugging LLM-powered applications
LangSmith · Braintrust · Promptfoo
The workflow usually ends up being: write some code, run it, tweak a prompt, add logs just to understand what actually happened. It works in some cases, breaks in others, and it’s hard to see why. You also want to know that changing a prompt or model didn’t quietly break everything.
Reticle puts the whole loop in one place.
You define a scenario (prompt + variables + tools), run it against different models, and see exactly what happened - prompts, responses, tool calls, results. You can then run evals against a dataset to see whether a change to the prompt or model breaks anything.
There’s also a step-by-step view for agent runs so you can see why it made a decision. Everything runs locally. Prompts, API keys, and run history stay on your machine (SQLite).
Stack: Tauri + React + SQLite + Axum + Deno.
Still early and definitely rough around the edges. Is this roughly how people are debugging LLM workflows today, or do you do it differently?
Replay, fork, diff, eval agent traces locally—like Git for agent behavior, fills a real gap.
Mitmproxy integration shows raw HTTP when LangSmith only shows parsed traces.
Academic methodology doc, not a working tool — agent frameworks already do this loop.
Tests tool calls and trace quality when LangSmith only checks output strings.
Machine-parseable traces for LLM agents when pdb and breakpoint() are useless.
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.