Back to browse
GitHub Repository
4 starsPython

Auto LLM Ranker – Describe a task in English and get ranked models

by gauravvij137·Mar 9, 2026·3 points·0 comments

AI Analysis

●●●BangerBig BrainDark HorseZero to One

Task-specific LLM benchmarking beats generic leaderboards that ignore your actual workload.

Strengths
  • Auto-generated test suites match your specific task instead of relying on static benchmarks
  • Parallel benchmarking across OpenRouter models captures real latency and accuracy tradeoffs
  • Judge LLM scoring across 5 dimensions reveals accuracy-speed correlations leaderboards hide
Weaknesses
  • Judge LLM introduces positional and familiarity bias in scoring consistency
  • Requires OpenRouter API credits for benchmarking, costs scale with model count
Category
Target Audience

AI engineers, developers selecting LLMs for production

Similar To

LangSmith · Braintrust · Arize Phoenix

Post Description

I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this.

You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity.

Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.

A few things I learned while building it:

- Score and latency rarely correlate. The best model for accuracy on coding tasks was almost never the fastest. This tradeoff is completely task-dependent and impossible to see from benchmarks that don't reflect your workload. - The Judge LLM approach is surprisingly consistent but introduces positional and familiarity bias. Using one model to score others isn't perfect, but it's far more reproducible than manual eval. Open to ideas on how to reduce judge bias without blowing up the cost. - Model discovery matters more than I expected. The top performers on generic benchmarks often weren't the top performers on narrow tasks.

Stack: Python, OpenRouter for model access, MIT licensed.

https://github.com/gauravvij/llm-evaluator

Happy to answer questions on the design decisions.

Similar Projects

AI/ML●●Solid

Find the best local LLM for your hardware, ranked by benchmarks

Ranks models by actual benchmark scores instead of just fitting the biggest model in VRAM.

Solve My ProblemShip It
andyyyy64
2836829d ago
AI/ML●●●Banger

Whichllm – Find and run the best local LLM for your hardware

One command finds and runs the best local LLM for your exact hardware specs.

Solve My ProblemBig BrainNiche Gem
andyyyy64
303mo ago