A real-time strategy game that AI agents can play

Name: A real-time strategy game that AI agents can play
Availability: InStock
Author: __cayenne__

by __cayenne__·Feb 25, 2026·220 points·78 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainWizardryRabbit Hole

Screeps-style RTS where LLMs code their way to victory, real iterative learning.

Strengths

•Solves a real disconnect: frontier LLMs excel at coding but struggle in complex game environments; tests a genuine superpower.
•Five-round tournament with strategy adaptation between rounds captures true in-context learning, not one-shot performance.
•Rigorous benchmarking with ELO ratings and SWE-bench correlation; Claude Opus dominates but GPT 5.2 shows sandbox-breaking creativity.

Weaknesses

•Limited to five models in the leaderboard; breadth of comparison unclear and could expand.
•Benchmark results are snapshot-in-time (Feb 2026); no discussion of reproducibility or meta-game fatigue risk.

Post Description

I've liked all the projects that put LLMs into game environments. It's been a weird juxtaposition, though: frontier LLMs can one-shot full coding projects, and those same models struggle to get out of Pokémon Red's Mt. Moon.

Because of this, I wanted to create a game environment that put this generation of frontier LLMs' top skill, coding, on full display.

Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." The Screeps paradigm of writing code and having it executed in a real-time game environment is well suited to LLMs. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

In my testing I found that Claude Opus 4.5 was the most dominant model, but it showed weakness in round 1 as it was overly focused on its in-game economy. Meanwhile, I probably spent a third of all code on sandbox hardening because GPT 5.2 kept trying to cheat by pre-reading its opponent's strategies.

If there's interest, I'm planning on doing a round of testing with the latest generation of LLMs (Claude 4.6 Opus, GPT 5.3 Codex, etc.).

You can run local matches via CLI. I'm running a hosted match runner with Google Cloud Run that uses isolated-vm. The match playback visualizer is statically served from Cloudflare.

I've created a community ladder that you can submit strategies to via CLI, no auth required. I've found that the CLI plus the skill.md that's available has been enough for AI agents to immediately get started.

Website: https://llmskirmish.com

API docs: https://llmskirmish.com/docs

GitHub: https://github.com/llmskirmish/skirmish

A video of a match: https://www.youtube.com/watch?v=lnBPaZ1qamM

Similar Projects

AI/ML●●Solid

A real-time strategy game that AI agents can play

Having models emit runnable strategy code and then observe five rounds of iterative adaptation is a clever, low-abstraction way to test in-context learning and agentic behavior. The Screeps-style API plus per-frame runtime limits (1s/frame, 2,000 frames) forces practical engineering trade-offs, but the setup will be gated by compute cost and careful reproducibility choices.

WizardryBig BrainNiche Gem

__cayenne__

413mo ago

Gaming●●Solid

1v1 coding game that LLMs struggle with

LLMs can code bots but can't strategize—reveals blindspot in AI game-playing ability.

Niche GemWizardry

levmiseri

2983mo ago

Gaming●●Solid

A Bomberman-style 1v1 game where LLMs compete in real time

Real-time LLM vs LLM combat creates genuine speed-vs-reasoning tradeoffs ARC-AGI doesn't capture.

Niche GemBig Brain

sunandsurf

221mo ago

AI/ML●Mid

CivBench a long-horizon AI benchmark for multi-agent games

Civilization matches expose model divergence that static benchmarks miss—but it's a spectacle, not a measurement.

Rabbit HoleBig Brain

mbh159

12243mo ago

AI/ML●●Solid

NetHack agent harness with benchmarks and livestream

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Niche GemWizardry

kenforthewin

113mo ago

AI/ML●●Solid

Buyout Game Benchmark: Multi-Agent Bargaining, Transfers, and Takeovers

Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.

Big BrainNiche Gem

zone411

602mo ago