Amber, a capability-based runtime/compiler for agent benchmarks
Fuchsia-inspired capability model for agent benchmarks solves reproducibility existing tools ignore.

Normalizes disparate benchmarks into a single IQ score, but relies on opaque calibration curves.
AI researchers, developers, and tech enthusiasts tracking model performance.
LMSys Chatbot Arena · Hugging Face Open LLM Leaderboard · Papers With Code
Fuchsia-inspired capability model for agent benchmarks solves reproducibility existing tools ignore.
Another career dashboard when LinkedIn Salary and Levels.fyi already exist.
Teaches you to spot when benchmark scores are noise versus signal before you trust a paper.
SJF4J beats Jayway by 7x on native objects, but JSONPath is a crowded category.
Mining your own PRs as benchmarks beats generic SWE-bench tasks for agent config tuning.
Unsupervised bug benchmark using agents as both attackers and defenders—novel scoring methodology.