Back to browse
GitHub Repository

Kitchen Rush: a benchmark for accurate AND fast native tool calling

0 starsPython

Kitchen Rush, Overcooked inspired LLM tool calling benchmark

by bombastic311·Jun 16, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Thinking time becomes game time lost — finally measures latency alongside accuracy.

Strengths
  • Latency-as-game-mechanic is genuinely novel compared to BFCL and ToolSandbox.
  • Fully deterministic runs with browser replay viewer for auditing model behavior.
  • Zero core dependencies, single KR score makes model comparison straightforward.
Weaknesses
  • Benchmark only — doesn't improve agents, just measures them.
  • Zero stars on GitHub suggests very early adoption and unproven utility.
Category
Target Audience

ML engineers evaluating real-time agent models

Similar To

BFCL · ToolSandbox · τ-bench

Similar Projects

AI/ML●●Solid

InferBench – Benchmark local LLM engines with one click

One-click LLM benchmarking with real tok/s metrics when llama.cpp requires manual setup.

Ship ItSolve My Problem
JoniMartin
2010d ago