Back to browse
Τ³-Bench is out – can agents handle complex docs and live calls?

Τ³-Bench is out – can agents handle complex docs and live calls?

by victorbarres·Mar 25, 2026·12 points·1 comment

AI Analysis

●●●BangerBig BrainNiche Gem

Tests agents on 700 policy docs and noisy voice calls where AgentBench stops.

Strengths
  • Tests reasoning over 700 interconnected policy docs, not just single-context retrieval.
  • Full-duplex voice evaluation includes realistic noise, accents, and interruptions for agents.
  • Verifiable task outcomes prevent models from hallucinating successful completion claims.
Weaknesses
  • Tightly coupled to customer service domain, less useful for general coding agents.
  • Requires significant setup to run locally against your own agent architecture.
Category
Target Audience

AI researchers, LLM developers, Enterprise AI teams

Similar To

AgentBench · GAIA · LiveBench

Post Description

τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It's been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we're extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.

τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.

τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.

We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.

Code and leaderboard are open — we'd welcome community submissions and feedback.

Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...

Similar Projects

AI/MLMid

Ebbforge - 10M agent Rust swarm engine, 8 fundamental benchmarks

Rust swarm vs LLM agents is clever positioning, but benchmarks are self-designed and lack third-party validation.

Big BrainWizardry
agent-world
213mo ago