Auto LLM Ranker – Describe a task in English and get ranked models

Name: Auto LLM Ranker – Describe a task in English and get ranked models
Availability: InStock
Author: gauravvij137

by gauravvij137·Mar 9, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainDark HorseZero to One

Task-specific LLM benchmarking beats generic leaderboards that ignore your actual workload.

Strengths

•Auto-generated test suites match your specific task instead of relying on static benchmarks
•Parallel benchmarking across OpenRouter models captures real latency and accuracy tradeoffs
•Judge LLM scoring across 5 dimensions reveals accuracy-speed correlations leaderboards hide

Weaknesses

•Judge LLM introduces positional and familiarity bias in scoring consistency
•Requires OpenRouter API credits for benchmarking, costs scale with model count

Post Description

I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this.

You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity.

Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.

A few things I learned while building it:

- Score and latency rarely correlate. The best model for accuracy on coding tasks was almost never the fastest. This tradeoff is completely task-dependent and impossible to see from benchmarks that don't reflect your workload. - The Judge LLM approach is surprisingly consistent but introduces positional and familiarity bias. Using one model to score others isn't perfect, but it's far more reproducible than manual eval. Open to ideas on how to reduce judge bias without blowing up the cost. - Model discovery matters more than I expected. The top performers on generic benchmarks often weren't the top performers on narrow tasks.

Stack: Python, OpenRouter for model access, MIT licensed.

https://github.com/gauravvij/llm-evaluator

Happy to answer questions on the design decisions.