Kitchen Rush, Overcooked inspired LLM tool calling benchmark

Name: Kitchen Rush, Overcooked inspired LLM tool calling benchmark
Availability: InStock
Author: bombastic311

by bombastic311·Jun 16, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Thinking time becomes game time lost — finally measures latency alongside accuracy.

Strengths

•Latency-as-game-mechanic is genuinely novel compared to BFCL and ToolSandbox.
•Fully deterministic runs with browser replay viewer for auditing model behavior.
•Zero core dependencies, single KR score makes model comparison straightforward.

Weaknesses

Data●●●Banger

7,560 runs proving cheaper models beat expensive ones on production OCR tasks.

Big BrainSolve My Problem

TimoKerr

511mo ago

AI/ML●●●Banger

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

303mo ago

AI/ML●●Solid

51 models, 1613 runs, $558 spent — finally proofreading benchmarks with real numbers.

Niche GemBig Brain

artursapek

302mo ago

AI/ML●●Solid

One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.

Ship ItSolve My Problem

JoniMartin

2010d ago

AI/ML●●Solid

Agent loop proofreading evals where HELM and LMSys are too generic.

Solve My ProblemShip It

artursapek

321mo ago

AI/ML●●Solid

One-command benchmark suite comparing Ollama and XGBoost performance with a shared Streamlit dashboard.

Solve My ProblemNiche Gem

albedan

201mo ago