Digest AI vs HN About

NetHack agent harness with benchmarks and livestream

NetHack agent harness with benchmarks and livestream

by kenforthewin·Feb 13, 2026·1 point·1 comment

Visit Project View on HN

AI Analysis

●●SolidNiche GemWizardry

The Take

You can watch an LLM play NetHack step-by-step with the model's reasoning, the exact action code, and a live game canvas — that instrumentation is the product's real selling point. The leaderboard + run/benchmark framing makes it useful for comparing agents rather than just a flashy demo, but it's still squarely for people who care about NetHack or agent evaluation; more detail on reproducible metrics and integrations would push it further.

Category

Target Audience

Reinforcement learning researchers, game-AI developers, ML hobbyists and demo-streamers

Similar Projects

Gaming●Mid

GPTFortress, a 24/7 live-stream playing Dwarf Fortress with GPT-5

Fun novelty stream, but no code or architecture to evaluate.

Rabbit Hole

leostera

1019d ago

AI/ML●●●Banger

OpenCastor Agent Harness Evaluator Leaderboard

263k config search space benchmarked across robot fleets—nothing like this exists for robotics AI.

Zero to OneBig BrainNiche Gem

craigm26

312mo ago

Hardware●●Solid

Livestream Robot Cooking

Chat-controlled robot arm livestream feels like Twitch Plays Robotics evolved.

WizardryRabbit Hole

HaixuanTao

112mo ago

AI/ML●●Solid

Buyout Game Benchmark: Multi-Agent Bargaining, Transfers, and Takeovers

Wealth-based scoring reveals strategic failures that survival-only benchmarks miss.

Big BrainNiche Gem

zone411

602mo ago

Developer Tools●●●Banger

Tracecore: Benchmark AI Agents on Deterministic Coding Tasks

Deterministic agent benchmarking with strict validation—unlike SWE-Bench, measures whether agents actually operate.

Solve My ProblemWizardryNiche Gem

extra_cookin

103mo ago

AI/ML●●●Banger

Rogue-Bench – LLMs play the game Rogue

Using 1980s Rogue as an LLM benchmark is genuinely novel and technically clever.

WizardryZero to One

iwhalen

1019d ago