Valohai LLM – Track and compare LLM evaluation results in one dashboard

Name: Valohai LLM – Track and compare LLM evaluation results in one dashboard
Availability: InStock
Author: radicain

by radicain·Feb 19, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidNiche GemSolve My Problem

The Take

Streams evals from a tiny Python client into a shared dashboard and lets you run parameter sweeps and compare up to six configurations with radar/bar charts and scorecards — exactly the sort of tooling that stops results getting lost in notebooks. Useful, pragmatic product for teams who repeatedly evaluate models, but it's competing with general observability/experiment trackers (W&B, Neptune) and will need strong integrations and metric flexibility to stand out.

Post Description

We built Valohai LLM for tracking and comparing LLM evaluation results. Whether your evals live in notebooks and spreadsheets, or you're using an observability tool that wasn't built for comparison, this gives you a purpose-built eval comparison dashboard.

Run evals with a Python library (pip install valohai-llm), results stream in, and you can compare up to 6 configurations side by side. Group by any dimension (model, category, difficulty) to see where each model excels.

It doesn't do tracing or production observability, for now just eval tracking and comparison. What's cool is that you can define parameters you would like to test with and run a sweep across all of them.

Feedback welcome, especially from anyone comparing models and evaluating regularly!