HermesBench – workflow reliability evals for personal AI agents
Whole-agent evals beat model-only benchmarks, but only one baseline published so far.
Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary
Evaluates AI at the action boundary, not just understanding quality—most benchmarks stop too early.
Legal AI developers, compliance teams, AI governance stakeholders
LangChain Evals · RAGAS · Arize Phoenix
Current result:
baseline executed 18 unjustified high-impact action points with VerifiedX that dropped to 0 false blocks in the current suite: 0 surviving-goal completion improved from 41.7% to 100% Same harness, same prompts, same playbooks, baseline vs VerifiedX.
Legal is the first public instance. The same method applies to support, healthcare RCM, procurement, and finance too.
Repo, methodology, and raw artifacts are public: https://github.com/bigkan8/legal-action-boundary-eval
Whole-agent evals beat model-only benchmarks, but only one baseline published so far.
External enforcement stops agents escaping sandboxes like Claude Code.
120+ built-in test playbooks with JSON output agents can read and fix.
First benchmark for physical-world AI when MMLU only tests textbook knowledge.
OAuth + TLS for AI agents with Ed25519 identity and global kill switch before agents act.
Project workspaces with persistent memory beat chat sessions for real agent work.