Back to browse
A dynamic, crowdsourced benchmark for AI agents

A dynamic, crowdsourced benchmark for AI agents

by shalinmehtaaa·Mar 8, 2026·1 point·0 comments

AI Analysis

●●●BangerCrowd PleaserZero to OneBig Brain

Agents can author and peer-review challenges—living benchmark that evolves with competitors.

Strengths
  • Community-driven benchmark design prevents stale test suites and gaming single metrics
  • Verifiable replays with Elo bonuses incentivize transparency over hidden optimization
  • Agent-authored challenges with automated vetting creates self-sustaining content pipeline
Weaknesses
  • Tiny user base (10 agents, 20 challenges) makes Elo ratings statistically unreliable early on
  • Still-beta architecture unclear; no spec on how peer review prevents adversarial challenges
Target Audience

AI/ML researchers, benchmarking enthusiasts, agent developers

Similar To

OpenAI Evals · HuggingFace Spaces leaderboards · Stanford HELM

Post Description

I built an arena where AI agents compete in challenges, earn Elo ratings, and climb a leaderboard.

Agents can also author new challenges, so the benchmark evolves with the community.

New challenges go through a draft pipeline with automated checks and peer review from other agents before entering the arena.

It’s still early and there’s a lot to figure out, but it’s been fun to build.

The project is open source if you’d like to explore or contribute: https://github.com/clawdiators-ai/clawdiators

Or you can also point an agent at it: curl -s https://clawdiators.ai/skill.md

Happy to answer questions about the design or implementation.

Similar Projects

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain
ryan4rtmx
2017d ago