LLM Observability Stack for Local Dev – Agent Super Apy
Mitmproxy integration shows raw HTTP when LangSmith only shows parsed traces.
The agent eval standard for MCP — score output quality, catch safety failures, enforce cost budgets
MCP-native observability means zero code changes for compatible agents.
Developers building AI agents with MCP
LangSmith · Arize · Helicone
So I built Iris. It's an open-source MCP server — not an SDK, not a proxy. Any MCP-compatible agent (Claude Desktop, Cursor, or anything built with the MCP SDK) discovers and uses it automatically. Add it to your MCP config and your agent gains observability without touching your code.
What it does:
- 3 MCP tools: log_trace (full execution traces with spans, tool calls, token usage, cost in USD), evaluate_output (score output quality against configurable rules), get_traces (query traces with filters and pagination) - 12 built-in eval rules across 4 categories: completeness (output length, coverage), relevance (keyword overlap, hallucination markers), safety (PII detection for SSN/credit card/phone/email, prompt injection patterns, blocklist), and cost (USD threshold, token efficiency) - Hierarchical span tree: trace exactly where in an agent's execution chain something went wrong — which tool call failed, which step was slow - Aggregate cost tracking: the dashboard shows total agent spend across all your agents over any time window, not just per-trace cost. You can finally answer "what are my agents costing me?" - Web dashboard: dark-mode React UI with summary cards, trace list, span tree view, eval results with per-rule breakdown - SQLite storage: single file, no database server. Back it up, move it, inspect it with any SQLite tool - Custom eval rules defined with Zod schemas
Security: API key auth, rate limiting (express-rate-limit), helmet headers, CORS, input validation, ReDoS-safe regex for user-supplied patterns, 1MB body limit.
Stack: TypeScript, Express 5, better-sqlite3, @modelcontextprotocol/sdk, Zod, pino.
Iris also exposes MCP resources — your agent can programmatically read iris://dashboard/summary to get aggregate metrics without opening the dashboard. Every trace logs full traceability, which also means you're building the audit trail that regulations like the EU AI Act will require by August 2026.
npm install -g @iris-eval/mcp-server iris-mcp --transport http --dashboard
Self-hosted, MIT licensed.GitHub: https://github.com/iris-eval/mcp-server npm: https://www.npmjs.com/package/@iris-eval/mcp-server
I'd appreciate feedback on two things specifically: 1. The eval rule system — are these the right 12 rules to ship with? What's missing? 2. The MCP tool API — three tools feels minimal but sufficient. Should trace logging and evaluation be combined or kept separate?
Check the roadmap for what's coming next: https://github.com/iris-eval/mcp-server/blob/main/docs/roadm...
Mitmproxy integration shows raw HTTP when LangSmith only shows parsed traces.
Separates debuggable logs from dispute-ready evidence—most tools conflate both.
Automated rollback on regression is a killer feature LangSmith doesn't have.
Turns every agent client into a unified work log via a tiny MCP server and a single log_work call. Schema profiles and multiple sinks (jsonl, webhook, Postgres) let teams standardize payloads once and collect logs from Codex, ChatGPT, Claude, OpenClaw or cron jobs. Practical and low-friction — useful as an ops-level glue — but the dashboard is an in-repo MVP and delivery semantics (retries/schema evolution) will determine if it scales beyond a small team.
Grafana for n8n workflows, but adoption depends on n8n's self-hosted user base.
Agent-native eval workflow beats LangSmith's manual dashboard setup.