GitHub Repository

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

5 starsPython

LOAB – AI agents get decisions right but skip the process [pdf]

Name: LOAB – AI agents get decisions right but skip the process [pdf]
Availability: InStock
Author: shubh-chat

by shubh-chat·Mar 9, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainBold Bet

Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.

Strengths

•First benchmark measuring process compliance, not just final decision outcomes.
•Five-dimension rubric covers tool calls, handoffs, forbidden actions, evidence.
•Mock regulatory APIs simulate real bank operations with multi-agent roles.

Weaknesses

•Currently only three origination tasks in proof-of-concept release.
•Australian mortgage focus limits immediate global applicability.

Post Description

LOAB, an open-source benchmark for evaluating whether AI agents can follow regulated lending processes — not just produce the right final answer. The motivation is simple: in mortgage lending, regulators don't care if you got the right answer. They care whether you followed the right process. Skip a KYC check, pull a credit bureau report before getting privacy consent, or approve a loan without the required policy lookup — that's a compliance failure even if the outcome was correct. Current AI benchmarks don't measure this. They evaluate what the agent decided, not how it got there. LOAB simulates a fictional Australian lender with mock regulatory APIs, multi-agent roles mirroring real bank operations, and a five-dimension scoring rubric derived from actual lending law. A run only passes if the outcome is correct AND the process was correct. The main finding: frontier models achieve 67-75% outcome accuracy but only 25-42% when you also require process compliance. It's surprisingly hard to get AI to follow a prescribed sequence of steps even when it clearly "knows" the right answer.