GPTFortress, a 24/7 live-stream playing Dwarf Fortress with GPT-5
Fun novelty stream, but no code or architecture to evaluate.

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.
Reinforcement learning researchers, game-AI developers, ML hobbyists and demo-streamers
Fun novelty stream, but no code or architecture to evaluate.
263k config search space benchmarked across robot fleets—nothing like this exists for robotics AI.
Chat-controlled robot arm livestream feels like Twitch Plays Robotics evolved.
Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.
Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.
Using 1980s Rogue as an LLM benchmark is genuinely novel and technically clever.