LOAB – AI agents get decisions right but skip the process [pdf]
Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.
LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.
Scores AI agents on process fidelity, not just outcomes—catches KYC skips that other benchmarks miss.
Lending institutions, AI systems engineers, compliance teams evaluating agent reliability
HELM (Stanford) · BIG-Bench · GPQA
Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.
First benchmark testing structured requirements on complex greenfield agent tasks.
Tests AI agents against realistic phishing in live scenarios, not static email-classification tasks.
Concrete safety benchmark for code agents when baseline evaluation barely exists.
The site weaponizes a compact set of benchmarks — throughput, RAM, cold-start, F1 score and install footprint — and even publishes raw JSON on GitHub, which makes it immediately useful for teams comparing ingestion options. Kreuzberg's Rust implementation posts jaw-dropping numbers against common tools; that's interesting, but the page leaves out crucial reproducibility details (datasets, seed runs, environment configs) you'd want before trusting the magnitude of those gaps.
jsPerf has owned JavaScript benchmarking for 15 years — this is a cleaner clone without differentiation.