GitHub Repository

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

5 starsPython

LOAB – benchmarking AI process fidelity in lending

Name: LOAB – benchmarking AI process fidelity in lending
Availability: InStock
Author: shubh-chat

by shubh-chat·Mar 3, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainSolve My ProblemZero to One

Scores AI agents on process fidelity, not just outcomes—catches KYC skips that other benchmarks miss.

Strengths

•Multi-dimensional rubric (outcome, tool use, handoffs, forbidden actions, evidence) tests realistic failure modes vs. single-metric benchmarks.
•Real lending domain with actual regulatory constraints and hard policy limits, not toy tasks.
•Extensible design across lending lifecycle and to other regulated industries.

Weaknesses

•Only three proof-of-concept tasks; unclear if results generalize beyond Australian mortgage origination.
•Pass rates are low (Claude 0/4 on clean approval); unclear if this reflects agent limitations or benchmark miscalibration.

Post Description

As the conversation around AI replacing knowledge work gets louder, I wanted to test it against something actually messy — a real lending workflow with regulations, ordered steps, and hard policy constraints. LOAB is my early attempt at that. The plan is to grow it across lending first, then other industries. Would love thoughts and feedback.

Similar Projects

AI/ML●●●Banger

LOAB – AI agents get decisions right but skip the process [pdf]

Frontier models hit 67-75% outcome accuracy but only 25-42% on process compliance.

Big BrainBold Bet

shubh-chat

103mo ago

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain

ryan4rtmx

2016d ago

Security●●●Banger

Scam – 1Password's open-source benchmark for AI security awareness

Tests AI agents against realistic phishing in live scenarios, not static email-classification tasks.

Big BrainZero to One

terracatta

204mo ago

Security●●Solid

I built AgentSafety, an open benchmark for coding-agent safety

Concrete safety benchmark for code agents when baseline evaluation barely exists.

Big BrainNiche GemSolve My Problem

serkanaltuntas

103mo ago

Developer Tools●Mid

Kreuzberg Comparative Benchmarks

The site weaponizes a compact set of benchmarks — throughput, RAM, cold-start, F1 score and install footprint — and even publishes raw JSON on GitHub, which makes it immediately useful for teams comparing ingestion options. Kreuzberg's Rust implementation posts jaw-dropping numbers against common tools; that's interesting, but the page leaves out crucial reproducibility details (datasets, seed runs, environment configs) you'd want before trusting the magnitude of those gaps.

Niche GemWizardry

nhirschfeld

104mo ago

Developer Tools●Mid

JavaScript Performance Benchmarking

jsPerf has owned JavaScript benchmarking for 15 years — this is a cleaner clone without differentiation.

Cozy

emurlin

212mo ago