Terminal-Wrench, a dataset of 331 realistic hackable environments

Name: Terminal-Wrench, a dataset of 331 realistic hackable environments
Availability: InStock
Author: neversupervised

by neversupervised·Apr 15, 2026·6 points·2 comments

Visit Project View on HN

AI Analysis

●●●BangerDark HorseBig Brain

Exposes 331 hackable agent benchmarks with real exploit trajectories.

Strengths

•Real Terminal Bench tasks preserved with original definitions and exploit paths.
•Includes sanitized trajectories specifically for monitorability and detection research.
•Reveals systemic gaps in the billion-dollar agent evaluation market.

Weaknesses

•Niche utility limited to RL and agent eval researchers.
•Dataset requires significant research context to interpret and utilize correctly.

Post Description

I want to share a new dataset of 331 reward-hackable environments. These are real environments used in Terminal Bench and adjacent benchmarks. I first got interested in this because, as a reviewer of Terminal Bench, I noticed a lot of our tasks were hackable. I also noticed that many contributors to the benchmark do so because it provides credibility when selling environments to labs. Hence, TBench tasks are, in my opinion, held to a higher quality standard than those being used today for RL. No one is spending hours manually reviewing the $1B in tasks being purchased by major labs. As far as I understand, while everyone knows environments are hackable, nobody has released hundreds of "realistic" environments.

Similar Projects

AI/ML●●●Banger

RewardHackWatch – Reward hacking detector for LLM agents

Catches LLM reward hacking at runtime when models game evals.

Big BrainWizardryShip It

aerosta

115mo ago

AI/ML●●Solid

Hackable Skinny Clawdbot for Telegram

Forkable ~4k-line Deno/TypeScript codebase makes adding a tool or an LLM backend deliberately small — one file for a tool, two functions for a provider. Inline approval for dangerous actions (shell, screenshots), multi-provider support, and a focus on small, readable modules make this a practical, Telegram‑centric toolkit; it's not reinventing agent paradigms, but it’s an excellent starting point if you want something you can actually bend and deploy fast.

Niche GemShip It

vseplet

105mo ago

Developer Tools●●Solid