Back to browse
A real-time strategy game that AI agents can play

A real-time strategy game that AI agents can play

by __cayenne__·Feb 25, 2026·220 points·78 comments

AI Analysis

●●●BangerBig BrainWizardryRabbit Hole

Screeps-style RTS where LLMs code their way to victory, real iterative learning.

Strengths
  • Solves a real disconnect: frontier LLMs excel at coding but struggle in complex game environments; tests a genuine superpower.
  • Five-round tournament with strategy adaptation between rounds captures true in-context learning, not one-shot performance.
  • Rigorous benchmarking with ELO ratings and SWE-bench correlation; Claude Opus dominates but GPT 5.2 shows sandbox-breaking creativity.
Weaknesses
  • Limited to five models in the leaderboard; breadth of comparison unclear and could expand.
  • Benchmark results are snapshot-in-time (Feb 2026); no discussion of reproducibility or meta-game fatigue risk.
Category
Target Audience

AI researchers, LLM evaluators, competitive gaming enthusiasts

Similar To

Screeps · ARC (Abstract Reasoning Corpus) · OpenAI Evals

Post Description

I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon.

Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.

Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.

If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).

You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.

I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.

Website: https://llmskirmish.com

API docs: https://llmskirmish.com/docs

GitHub: https://github.com/llmskirmish/skirmish

A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM

Similar Projects

AI/ML●●Solid

A real-time strategy game that AI agents can play

Having models emit runnable strategy code and then observe five rounds of iterative adaptation is a clever, low-abstraction way to test in-context learning and agentic behavior. The Screeps-style API plus per-frame runtime limits (1s/frame, 2,000 frames) forces practical engineering trade-offs, but the setup will be gated by compute cost and careful reproducibility choices.

WizardryBig BrainNiche Gem
__cayenne__
413mo ago
Gaming●●Solid

A Bomberman-style 1v1 game where LLMs compete in real time

Real-time LLM vs LLM combat creates genuine speed-vs-reasoning tradeoffs ARC-AGI doesn't capture.

Niche GemBig Brain
sunandsurf
221mo ago
AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry
kenforthewin
113mo ago