New Benchmark from SWE-bench team is 0% solved
Agents fail completely at rebuilding binaries from scratch without source code.
retrospec reverse-engineers plausible high-level spec prompts from Git commits using iterative Copilot SDK agent loops and similarity/realism scoring.
It doesn't guess diff hunks — it runs iterative Copilot-agent loops and ranks candidate prompts by technical similarity and a separate 'realism' score, with explicit rules (no code blocks, structured markdown sections) to keep outputs human-like. The alpha-weighted scoring, model override, and prebuilt binaries show this is more than an experiment: it's practical for mining realistic specs from history or auditing intent at scale.
Backend/frontend developers, engineering managers, researchers building prompt->code datasets, and anyone wanting to infer intent from commits
Given a repo + a specific commit, it iteratively searches for a plausible high-level spec prompt that could have produced that change. It runs agent loops, scores candidates for technical similarity and "realism" (does this look like a prompt a human would actually write), and outputs the best spec.
Inspiration: I saw Mitchell Hashimoto mention experimenting with agents to reproduce manual code edits, and around the same time GitHub released the Copilot SDK.
Repo: github.com/igolaizola/retrospec
Agents fail completely at rebuilding binaries from scratch without source code.
This repo actually wires the specfact CLI to a tiny, reproducible codebase so you can import-from-code, generate .specfact bundles, and run enforcement presets with one-liners. The backlog-sync adapter and a deliberately buggy sidecar demo make failure modes easy to exercise, and the README lists exact smoke commands and logs to verify results. Inferred specs will always risk false positives, but the project shows practical artifacts (change_tracking, results logs) rather than theory.
Spec extraction from vibe-coded apps via reverse engineering—ambitious, but early and single-integration.
Shifts agents from tool operators to commitment coordinators with evidence settlement.
AST parsing beats scraping when Slack kills their OpenAPI spec.
Third in a series (GhidrAssist, BinAssist, IDAssist) — fills the IDA gap competently.