Back to browse
GitHub Repository

Benchmark any LLM against your data. Pick the best model, then make it better.

4 starsPython

Verdict – model evals on your own data, not someone else's benchmark

by agunapal·May 7, 2026·2 points·0 comments

AI Analysis

●●SolidSolve My ProblemSlick

Run your own data against GPT-5 and Llama to pick the winner.

Strengths
  • Pluggable metrics include both ROUGE scores and LLM judges.
  • Unified CLI interface supports OpenAI, Anthropic, and local Ollama.
  • Focuses on custom data rather than generic public benchmarks.
Weaknesses
  • Crowded category with many existing eval frameworks like RAGAS.
  • Another Python wrapper around standard provider APIs.
Category
Target Audience

ML engineers and prompt engineers

Similar To

RAGAS · LangSmith · DeepEval

Similar Projects