Back to browse
GitHub Repository

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

5 starsPython

LOAB – AI agents get decisions right but skip the process [pdf]

by shubh-chat·Mar 9, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainBold Bet

Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.

Strengths
  • First benchmark measuring process compliance, not just final decision outcomes.
  • Five-dimension rubric covers tool calls, handoffs, forbidden actions, evidence.
  • Mock regulatory APIs simulate real bank operations with multi-agent roles.
Weaknesses
  • Currently only three origination tasks in proof-of-concept release.
  • Australian mortgage focus limits immediate global applicability.
Category
Target Audience

AI developers building agents for regulated industries, compliance teams

Similar To

GAIA · AgentBench · SWE-bench

Post Description

LOAB, an open-source benchmark for evaluating whether AI agents can follow regulated lending processes — not just produce the right final answer. The motivation is simple: in mortgage lending, regulators don't care if you got the right answer. They care whether you followed the right process. Skip a KYC check, pull a credit bureau report before getting privacy consent, or approve a loan without the required policy lookup — that's a compliance failure even if the outcome was correct. Current AI benchmarks don't measure this. They evaluate what the agent decided, not how it got there. LOAB simulates a fictional Australian lender with mock regulatory APIs, multi-agent roles mirroring real bank operations, and a five-dimension scoring rubric derived from actual lending law. A run only passes if the outcome is correct AND the process was correct. The main finding: frontier models achieve 67-75% outcome accuracy but only 25-42% when you also require process compliance. It's surprisingly hard to get AI to follow a prescribed sequence of steps even when it clearly "knows" the right answer.

Similar Projects

AI/ML●●●Banger

LOAB – benchmarking AI process fidelity in lending

Scores AI agents on process fidelity, not just outcomes—catches KYC skips that other benchmarks miss.

Big BrainSolve My ProblemZero to One
shubh-chat
103mo ago