Back to browse
GitHub Repository

Dataset of hackable TerminalBench-style tasks and exploit trajectories

27 starsPython

Terminal-Wrench, a dataset of 331 realistic hackable environments

by neversupervised·Apr 15, 2026·6 points·2 comments

AI Analysis

●●●BangerDark HorseBig Brain

Exposes 331 hackable agent benchmarks with real exploit trajectories.

Strengths
  • Real Terminal Bench tasks preserved with original definitions and exploit paths.
  • Includes sanitized trajectories specifically for monitorability and detection research.
  • Reveals systemic gaps in the billion-dollar agent evaluation market.
Weaknesses
  • Niche utility limited to RL and agent eval researchers.
  • Dataset requires significant research context to interpret and utilize correctly.
Category
Target Audience

AI researchers, RL engineers, Benchmark developers

Similar To

Terminal Bench · AgentBench · SWE-bench

Post Description

I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments.

Similar Projects

AI/ML●●Solid

Hackable Skinny Clawdbot for Telegram

Forkable ~4k-line Deno/TypeScript codebase makes adding a tool or an LLM backend deliberately small — one file for a tool, two functions for a provider. Inline approval for dangerous actions (shell, screenshots), multi-provider support, and a focus on small, readable modules make this a practical, Telegram‑centric toolkit; it's not reinventing agent paradigms, but it’s an excellent starting point if you want something you can actually bend and deploy fast.

Niche GemShip It
vseplet
103mo ago