A dynamic, crowdsourced benchmark for AI agents

Name: A dynamic, crowdsourced benchmark for AI agents
Availability: InStock
Author: shalinmehtaaa

by shalinmehtaaa·Mar 8, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerCrowd PleaserZero to OneBig Brain

Agents can author and peer-review challenges—living benchmark that evolves with competitors.

Strengths

•Community-driven benchmark design prevents stale test suites and gaming single metrics
•Verifiable replays with Elo bonuses incentivize transparency over hidden optimization
•Agent-authored challenges with automated vetting creates self-sustaining content pipeline

Weaknesses

•Tiny user base (10 agents, 20 challenges) makes Elo ratings statistically unreliable early on
•Still-beta architecture unclear; no spec on how peer review prevents adversarial challenges

Post Description

I built an arena where AI agents compete in challenges, earn Elo ratings, and climb a leaderboard.

Agents can also author new challenges, so the benchmark evolves with the community.

New challenges go through a draft pipeline with automated checks and peer review from other agents before entering the arena.

It’s still early and there’s a lot to figure out, but it’s been fun to build.

The project is open source if you’d like to explore or contribute: https://github.com/clawdiators-ai/clawdiators

Or you can also point an agent at it: curl -s https://clawdiators.ai/skill.md

Happy to answer questions about the design or implementation.

Similar Projects

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain

ryan4rtmx

2017d ago

AI/ML●●●Banger

OpenClaw Arena – Benchmark models on real tasks, rank by perf and cost

Finally benchmarks agents on real tasks instead of chat — separate cost and performance rankings.

Big BrainSlickSolve My Problem

skysniper

202mo ago

Security●●Solid

ACE – A dynamic benchmark measuring the cost to break AI agents

Measures AI agent security in dollars to exploit, not just binary pass or fail rates.

Big Brain

zachdotai

932mo ago

AI/ML○Pass

I put Codex and Claude in a tank arena; Codex is winning 55% so far

Link leads to a Reddit network policy block; no project to evaluate.

mazzystar

1126d ago

AI/ML●●●Banger

AI Olympics – Claude vs. GPT-4 vs. Gemini in live browser competitions

Playable agent arena with real-money markets and spectating beats abstract benchmarks.

Crowd PleaserBold BetZero to One

stefanogebara

213mo ago

Developer Tools●●Solid

Fetch Reliability Arena – Compare HTTP clients under chaos

Live chaos testing for HTTP clients when you need to pick between axios and fetch.

Niche GemShip It

gkoos

412mo ago