Back to browse
GitHub Repository

A test runner for agentskills.io-style AI agent skills

584 starsTypeScript

Agent-skills-eval – Test whether Agent Skills improve outputs

by darkrishabh·May 7, 2026·79 points·37 comments

AI Analysis

●●SolidSolve My ProblemShip It

Lightweight A/B testing for SKILL.md files when LangSmith feels too heavy.

Strengths
  • Local A/B tests with judge models remove the need for heavy SaaS platforms.
  • Generates static HTML reports with side-by-side output comparisons for easy debugging.
  • Integrates directly into CI workflows to prevent skill regressions before deployment.
Weaknesses
  • Tied to the emerging SKILL.md standard which may not gain widespread adoption.
  • Judge model grading can be inconsistent depending on the chosen evaluator model.
Target Audience

AI agent developers and prompt engineers

Similar To

LangSmith · Arize Phoenix · PromptLayer

Similar Projects

AI/MLMid

Claude Code skills for building LLM evals

Structured eval workflow for Claude Code when LangSmith and Braintrust already exist.

Niche GemShip It
paulaq
201mo ago