RewardHackWatch – Reward hacking detector for LLM agents
Catches LLM reward hacking at runtime when models game evals.
Dataset of hackable TerminalBench-style tasks and exploit trajectories
Exposes 331 hackable agent benchmarks with real exploit trajectories.
AI researchers, RL engineers, Benchmark developers
Terminal Bench · AgentBench · SWE-bench
Catches LLM reward hacking at runtime when models game evals.
Forkable ~4k-line Deno/TypeScript codebase makes adding a tool or an LLM backend deliberately small — one file for a tool, two functions for a provider. Inline approval for dangerous actions (shell, screenshots), multi-provider support, and a focus on small, readable modules make this a practical, Telegram‑centric toolkit; it's not reinventing agent paradigms, but it’s an excellent starting point if you want something you can actually bend and deploy fast.
Trajectory tracking beats diffs for agent memory, but terminal recording isn't new.
First public archive for real agent trajectories—nothing like this existed before.
Community jailbreaks with published exploits, but Lakera and Gandalf already cover AI red-teaming.
Proves you have the exploit without leaking it, using SP1 zkVM and Drand tlock.