Back to browse
GitHub Repository

Find what your AI agent gets wrong — before you have a rubric. Qualitative eval for PMs.

5 starsPython

GEDD – A Systematic Evidence Driven LLM as a Judge Framework

by balasvce2026·Jun 13, 2026·2 points·0 comments

AI Analysis

●●SolidBig BrainNiche Gem

Qualitative eval workflow for PMs when LangSmith and Arize target ML engineers.

Strengths
  • Axial coding and failure codebooks borrow from qualitative research methods, not just metrics.
  • Bridges domain experts and engineers with shared annotation vocabulary instead of abstract scores.
  • Exports session.json handoff with validated prompts ready for MLflow regression gates.
Weaknesses
  • AWS samples repo with 5 stars signals reference implementation, not production tool.
  • LLM-as-a-judge evaluation is crowded with LangSmith, Arize Phoenix, and custom frameworks.
Category
Target Audience

Product managers and ML engineers evaluating AI agents

Similar To

LangSmith · Arize Phoenix · MLflow

Similar Projects

AI/MLMid

Claude Code skills for building LLM evals

Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.

Niche GemShip It
paulaq
201mo ago
AI/ML●●●Banger

I made a small helper for checking model-graded answers

Structurally verifies LLM judge reasoning instead of paying for a second model check.

Big BrainSolve My ProblemDark Horse
ML0037
201d ago