Back to browse
GitHub Repository

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

5 starsPython

LOAB – benchmarking AI process fidelity in lending

by shubh-chat·Mar 3, 2026·1 point·0 comments

AI Analysis

●●●BangerBig BrainSolve My ProblemZero to One

Scores AI agents on process fidelity, not just outcomes—catches KYC skips that other benchmarks miss.

Strengths
  • Multi-dimensional rubric (outcome, tool use, handoffs, forbidden actions, evidence) tests realistic failure modes vs. single-metric benchmarks.
  • Real lending domain with actual regulatory constraints and hard policy limits, not toy tasks.
  • Extensible design across lending lifecycle and to other regulated industries.
Weaknesses
  • Only three proof-of-concept tasks; unclear if results generalize beyond Australian mortgage origination.
  • Pass rates are low (Claude 0/4 on clean approval); unclear if this reflects agent limitations or benchmark miscalibration.
Category
Target Audience

Lending institutions, AI systems engineers, compliance teams evaluating agent reliability

Similar To

HELM (Stanford) · BIG-Bench · GPQA

Post Description

As the conversation around AI replacing knowledge work gets louder, I wanted to test it against something actually messy — a real lending workflow with regulations, ordered steps, and hard policy constraints. LOAB is my early attempt at that. The plan is to grow it across lending first, then other industries. Would love thoughts and feedback.

Similar Projects

AI/ML●●Solid

Agentic Intent Benchmark

First benchmark testing structured requirements on complex greenfield agent tasks.

Niche GemBig Brain
ryan4rtmx
2016d ago

Kreuzberg Comparative Benchmarks

The site weaponizes a compact set of benchmarks — throughput, RAM, cold-start, F1 score and install footprint — and even publishes raw JSON on GitHub, which makes it immediately useful for teams comparing ingestion options. Kreuzberg's Rust implementation posts jaw-dropping numbers against common tools; that's interesting, but the page leaves out crucial reproducibility details (datasets, seed runs, environment configs) you'd want before trusting the magnitude of those gaps.

Niche GemWizardry
nhirschfeld
104mo ago