Τ³-Bench is out – can agents handle complex docs and live calls?

Name: Τ³-Bench is out – can agents handle complex docs and live calls?
Availability: InStock
Author: victorbarres

by victorbarres·Mar 25, 2026·12 points·1 comment

Visit Project View on HN

AI Analysis

●●●BangerBig BrainNiche Gem

Tests agents on 700 policy docs and noisy voice calls where AgentBench stops.

Strengths

•Tests reasoning over 700 interconnected policy docs, not just single-context retrieval.
•Full-duplex voice evaluation includes realistic noise, accents, and interruptions for agents.
•Verifiable task outcomes prevent models from hallucinating successful completion claims.

Weaknesses

•Tightly coupled to customer service domain, less useful for general coding agents.
•Requires significant setup to run locally against your own agent architecture.

Post Description

τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It's been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we're extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.

τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.

τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.

We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.

Code and leaderboard are open — we'd welcome community submissions and feedback.

Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...