Back to browse
Valohai LLM – Track and compare LLM evaluation results in one dashboard

Valohai LLM – Track and compare LLM evaluation results in one dashboard

by radicain·Feb 19, 2026·3 points·0 comments

AI Analysis

●●SolidNiche GemSolve My Problem
The Take

Streams evals from a tiny Python client into a shared dashboard and lets you run parameter sweeps and compare up to six configurations with radar/bar charts and scorecards — exactly the sort of tooling that stops results getting lost in notebooks. Useful, pragmatic product for teams who repeatedly evaluate models, but it's competing with general observability/experiment trackers (W&B, Neptune) and will need strong integrations and metric flexibility to stand out.

Category
Target Audience

ML engineers, MLOps teams, data scientists and evaluation-focused researchers

Post Description

We built Valohai LLM for tracking and comparing LLM evaluation results. Whether your evals live in notebooks and spreadsheets, or you're using an observability tool that wasn't built for comparison, this gives you a purpose-built eval comparison dashboard.

Run evals with a Python library (pip install valohai-llm), results stream in, and you can compare up to 6 configurations side by side. Group by any dimension (model, category, difficulty) to see where each model excels.

It doesn't do tracing or production observability, for now just eval tracking and comparison. What's cool is that you can define parameters you would like to test with and run a sweep across all of them.

Feedback welcome, especially from anyone comparing models and evaluating regularly!

Similar Projects

OpenCode Benchmark Dashboard

Benchmarks OpenCode models locally, but lacks preloaded datasets and only works with configured OpenAI-compatible APIs.

Niche Gem
grigio
103mo ago