Digest AI vs HN About

GitHub Repository

A GUI-first evaluation workbench for local LLMs running on Ollama. Build personal test suites, run sequential evaluations across installed models, visualize results through dashboards, and make keep-or-delete decisions. Think "Postman for local LLM evaluation."

20 starsTypeScript

ModelSweep - Open-Source Benchmarking for Local LLMs

by leonickson·Mar 17, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidShip ItNiche GemSlick

Postman for local LLMs with LLM-as-Judge and Elo ratings built in.

Strengths

•Sequential model testing with automatic VRAM preload/unload management
•Four evaluation modes including adversarial red team testing scenarios
•Fully local execution with zero data leaving the machine

Weaknesses

•Two-day build means bugs and rough edges still present
•Ollama-only limits broader model runner compatibility

Category

Target Audience

Developers testing local LLMs, Ollama users, AI researchers

Similar To

LangSmith · MLflow · LM Evaluation Harness

Similar Projects

AI/ML●●●Banger

Ranking 19 LLMs on Flutter code by compile pass and hidden-test pass 1

Hidden executable tests separate working code from static analysis fluff better than HumanEval.

Big BrainDark Horse

GeorgiKadrev

203d ago

AI/ML●●●Banger

Auto LLM Ranker – Describe a task in English and get ranked models

Task-specific LLM benchmarking beats generic leaderboards that ignore your actual workload.

Big BrainDark HorseZero to One

gauravvij137

304mo ago

Developer Tools●●Solid

Aludel – LLM eval workbench for Phoenix apps

Phoenix LiveView embedding beats switching to LangSmith for Elixir teams.

Niche GemShip It

wood-archer

203mo ago

Developer Tools●Mid

FC-Eval – CLI to Benchmark Local or Cloud LLMs on Function Calling

AST-based validation for function calling tests, but BFCL already covers this ground.

Ship ItNiche Gem

gauravvij137

304mo ago

Developer Tools●●Solid

Openleetcode – local LeetCode runner with open test suites

Open test suites in the repo when LeetCode keeps theirs closed.

Niche Gem

therepanic

4018d ago

AI/ML●●●Banger

LLM Sycophancy Benchmark: Opposite-Narrator Contradictions

Opposite-narrator test catches models agreeing with both sides of same dispute.

Big BrainDark Horse

zone411

304mo ago