Back to browse
GitHub Repository

A multi-agent benchmark where eight LLMs play a money-driven elimination game with private transfers and a buyout endgame, and are ranked by final wealth

16 stars

Buyout Game Benchmark: Multi-Agent Bargaining, Transfers, and Takeovers

by zone411·Mar 30, 2026·6 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.

Strengths
  • Private transfers plus public votes creates genuine coalition pricing dynamics.
  • Bradley-Terry ranking over mirrored match packs ensures statistical rigor.
  • Final wealth vs finish order exposes models that survive but overspend.
Weaknesses
  • Niche audience—LLM benchmarking is crowded with established frameworks.
  • No API or programmatic access to run custom model evaluations yet.
Category
Target Audience

AI researchers, LLM eval engineers, multi-agent systems developers

Similar To

LangChain Evaluators · HELM · AgentBench

Similar Projects

AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry
kenforthewin
113mo ago
AI/ML●●Solid

A real-time strategy game that AI agents can play

Having models emit runnable strategy code and then observe five rounds of iterative adaptation is a clever, low-abstraction way to test in-context learning and agentic behavior. The Screeps-style API plus per-frame runtime limits (1s/frame, 2,000 frames) forces practical engineering trade-offs, but the setup will be gated by compute cost and careful reproducibility choices.

WizardryBig BrainNiche Gem
__cayenne__
413mo ago