GitHub Repository

The agent eval standard for MCP — score output quality, catch safety failures, enforce cost budgets

7 starsTypeScript

Iris – first MCP-native eval and observability tool for AI agents

Name: Iris – first MCP-native eval and observability tool for AI agents
Availability: InStock
Author: iparent

by iparent·Mar 14, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidNiche GemShip It

MCP-native observability means zero code changes for compatible agents.

Strengths

•Three MCP tools (log_trace, evaluate_output, get_traces) integrate directly into agent workflows
•12 built-in eval rules across completeness, relevance, and other categories
•Production security with rate limiting, CORS, and input validation baked in

Weaknesses

•MCP adoption is still limited, restricting the potential user base
•LangSmith and Arize already dominate the AI observability space

Post Description

I kept running into the same problem building AI agents: once they're running, I have no idea what they're actually doing. Traditional monitoring shows me HTTP 200. It can't tell me the output was wrong, that the agent leaked a user's email address, or that a single tool call in the chain is burning through tokens.

So I built Iris. It's an open-source MCP server — not an SDK, not a proxy. Any MCP-compatible agent (Claude Desktop, Cursor, or anything built with the MCP SDK) discovers and uses it automatically. Add it to your MCP config and your agent gains observability without touching your code.

What it does:

- 3 MCP tools: log_trace (full execution traces with spans, tool calls, token usage, cost in USD), evaluate_output (score output quality against configurable rules), get_traces (query traces with filters and pagination) - 12 built-in eval rules across 4 categories: completeness (output length, coverage), relevance (keyword overlap, hallucination markers), safety (PII detection for SSN/credit card/phone/email, prompt injection patterns, blocklist), and cost (USD threshold, token efficiency) - Hierarchical span tree: trace exactly where in an agent's execution chain something went wrong — which tool call failed, which step was slow - Aggregate cost tracking: the dashboard shows total agent spend across all your agents over any time window, not just per-trace cost. You can finally answer "what are my agents costing me?" - Web dashboard: dark-mode React UI with summary cards, trace list, span tree view, eval results with per-rule breakdown - SQLite storage: single file, no database server. Back it up, move it, inspect it with any SQLite tool - Custom eval rules defined with Zod schemas

Security: API key auth, rate limiting (express-rate-limit), helmet headers, CORS, input validation, ReDoS-safe regex for user-supplied patterns, 1MB body limit.

Stack: TypeScript, Express 5, better-sqlite3, @modelcontextprotocol/sdk, Zod, pino.

Iris also exposes MCP resources — your agent can programmatically read iris://dashboard/summary to get aggregate metrics without opening the dashboard. Every trace logs full traceability, which also means you're building the audit trail that regulations like the EU AI Act will require by August 2026.

npm install -g @iris-eval/mcp-server iris-mcp --transport http --dashboard

Self-hosted, MIT licensed.

GitHub: https://github.com/iris-eval/mcp-server npm: https://www.npmjs.com/package/@iris-eval/mcp-server

I'd appreciate feedback on two things specifically: 1. The eval rule system — are these the right 12 rules to ship with? What's missing? 2. The MCP tool API — three tools feels minimal but sufficient. Should trace logging and evaluation be combined or kept separate?

Check the roadmap for what's coming next: https://github.com/iris-eval/mcp-server/blob/main/docs/roadm...

Similar Projects

Developer Tools●●Solid

LLM Observability Stack for Local Dev – Agent Super Apy

Mitmproxy integration shows raw HTTP when LangSmith only shows parsed traces.

Ship ItSolve My Problem

simple10

203mo ago

AI/ML●●Solid

What Did My Agent Do? Compare logs to signed records

Separates debuggable logs from dispute-ready evidence—most tools conflate both.

Big BrainDark Horse

jithinraj

202mo ago

Developer Tools●●Solid

TruLayer – tracing, evals, and a control loop for production LLMs

Automated rollback on regression is a killer feature LangSmith doesn't have.

Solve My ProblemSlick

trulayer

2021d ago

Developer Tools●●Solid

Agent Breadcrumbs – Unified Work Log Across Claude, Codex, OpenClaw

Turns every agent client into a unified work log via a tiny MCP server and a single log_work call. Schema profiles and multiple sinks (jsonl, webhook, Postgres) let teams standardize payloads once and collect logs from Codex, ChatGPT, Claude, OpenClaw or cron jobs. Practical and low-friction — useful as an ops-level glue — but the dashboard is an in-repo MVP and delivery semantics (retries/schema evolution) will determine if it scales beyond a small team.

Niche GemShip It

ejcho623

103mo ago

Developer Tools●●Solid

N8n-trace – Grafana-like observability for n8n workflows

Grafana for n8n workflows, but adoption depends on n8n's self-hosted user base.

Solve My ProblemNiche Gem

mj95

213mo ago

Developer Tools●●Solid

An agent skill for eval-driven development of LLM-powered app

Agent-native eval workflow beats LangSmith's manual dashboard setup.

Big BrainShip It

yol

103mo ago