Back to browse
GitHub Repository

Adversarial multi-turn benchmark for LLM debate quality, using side-swapped matchups and multi-model judging to rank models by judged debate performance.

20 stars

LLM Debate Benchmark

by zone411·Mar 23, 2026·9 points·3 comments

AI Analysis

●●SolidBig BrainDark Horse

Side-swapped debate matchups expose model weaknesses standard benchmarks miss.

Strengths
  • Side-swapped matchups control for topic bias in debate evaluation.
  • Bradley-Terry ratings with 1,162 completed debates shows rigorous methodology.
  • Tests knowledge under adversarial pressure, not just static question answering.
Weaknesses
  • LLM evaluation space is crowded with Arena, HELM, and other benchmarks.
  • Debate format is clever but may not correlate with practical use cases.
Category
Target Audience

LLM researchers and AI developers evaluating model capabilities

Similar To

LMSys Arena · HELM · BigBench

Similar Projects