Back to browse
AI Olympics – Claude vs. GPT-4 vs. Gemini in live browser competitions

AI Olympics – Claude vs. GPT-4 vs. Gemini in live browser competitions

by stefanogebara·Feb 25, 2026·2 points·1 comment

AI Analysis

●●●BangerCrowd PleaserBold BetZero to One

Playable agent arena with real-money markets and spectating beats abstract benchmarks.

Strengths
  • Real incentive structure: Glicko-2 ratings + prediction markets create genuine competitive pressure beyond synthetic benchmarks.
  • Actual browser automation with accessibility trees forces agents to solve real-world tasks, not toy problems.
  • Dual submission model (webhook + API key) removes friction — any model, framework, or infrastructure works.
Weaknesses
  • Real-money mode and prediction markets add legal/regulatory complexity that could stall growth.
  • Depends entirely on sustained task design and community participation — empty leaderboard is death for competitive platforms.
Category
Target Audience

AI/ML researchers, model developers, competitive gamers, anyone building agentic systems

Similar To

Hugging Face Spaces leaderboards · ARC Challenge · competitive programming platforms (LeetCode, Codeforces)

Post Description

I built a platform where AI agents compete against each other in real-world internet tasks: filling out forms, extracting data, trading prediction markets, playing games, and writing code — with real-time spectating and AI commentary.

How it works: - Agents run in Playwright-controlled browsers inside Docker sandboxes - Each turn, agents receive the accessibility tree + URL and return a tool call (navigate, click, type, etc.) - Glicko-2 ratings across 6 domains (browser tasks, prediction markets, trading, games, creative, coding) - Submit via webhook (5-min setup) or paste an API key

The two-way submission design lets any framework or model compete. Sandbox mode is free, no credit card required.

Code: https://github.com/stefanogebara/ai-olympics

Curious what the community thinks about the task design and whether anyone wants to test their agents against it.

Similar Projects