Yardstiq – Compare LLM outputs side-by-side in your terminal

Name: Yardstiq – Compare LLM outputs side-by-side in your terminal
Availability: InStock
Author: stanleycyang

by stanleycyang·Mar 3, 2026·2 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidSolve My ProblemSlickCrowd Pleaser

One-command model comparison with real-time streaming and performance metrics beats tab-switching.

Strengths

•Eliminates genuine friction (copy-paste between tabs); streaming side-by-side layout is faster than sequential API calls.
•Unified auth (Vercel AI Gateway or individual keys), local Ollama support, and multiple export formats (JSON, Markdown, HTML).
•AI judge mode and YAML benchmark suites enable repeatable, scored evaluation across model cohorts.

Weaknesses

•No validation shown for judge scoring accuracy or reproducibility; 'AI evaluates responses' is trendy but unproven.
•Solves a developer-experience problem, not a technical one; CyberChef, Anthropic Workbench, and OpenAI Playground offer similar workflows.

Post Description

Hey HN,

I built yardstiq because I got tired of the copy-paste workflow for comparing LLM responses when developing apps. Every time I wanted to see how Claude vs GPT vs Gemini handled the same prompt, I'd open three tabs, paste the same thing, and try to eyeball the differences. It's 2026 and we have 40+ models worth considering — that doesn't scale.

yardstiq is a CLI tool that sends one prompt to multiple models simultaneously and streams the responses side-by-side in your terminal. It also tracks performance metrics (time to first token, tokens/sec, cost) and optionally runs an AI judge to score the outputs.

``` npx yardstiq "Explain quicksort in 3 sentences" -m claude-sonnet -m gpt-4o ```

What it does:

- Streams responses from multiple models in parallel, rendered in columns - Shows TTFT, throughput (tok/s), token counts, and cost per request - AI judge mode: have a model evaluate and score the responses - Export to JSON, Markdown, or self-contained HTML reports - Run YAML-defined benchmark suites across models with aggregate scoring - Works with Ollama for local model comparisons (zero API cost) - Supports 40+ models via direct provider keys or Vercel AI Gateway

I built this mostly for my own workflow — picking models for different tasks, testing prompt variations, and running quick benchmarks without setting up a whole evaluation framework. It's not trying to replace serious eval platforms, just make the "which model is better for X?" question answerable in 10 seconds.

MIT licensed, written in TypeScript: https://github.com/stanleycyang/yardstiq

Happy to answer questions about the architecture or benchmarking approach.

Similar Projects

Developer Tools●●Solid

LLMWise – Compare, Blend, and Judge LLM Outputs from One API

Multi-model orchestration with MoA blending and circuit-breaker failover, but LiteLLM and Anthropic Batch already exist.

SlickSolve My Problem

dm118

103mo ago

Developer Tools●●Solid

Real-Time AI Design Benchmark

Live multi-model comparison beats static benchmarks, but AI UI generation is crowded.

Eye CandySlick

kemyd

203mo ago

AI/ML●●Solid

Why use one AI model when you can use all of them at once!

One prompt, many models — that simple idea is executed with practical extras: independent conversation threads per model, full-text history/search, and bring‑your‑own API keys so you don't copy/paste. The landing page sells the daily‑driver vibe (lifetime one-time pricing is an attention grabber), but the concept itself is not novel; I'd want clearer UI for cost controls, API key security and model/version management before trusting it for heavy use.

SlickSolve My Problem

lurker325

103mo ago

AI/ML●Mid

AptSelect – A local desktop app to test LLMs side-by-side

Prompt versioning is nice, but web tools and Cursor already do side-by-side comparison.

SlickShip It

dhavalt

202mo ago

AI/ML●●Solid

LLMxRay an open-source observability tool for LLMs

Multilingual tokenization comparison across Arabic, Chinese, French that LangSmith ignores.

Big BrainNiche Gem

lognebudo

103mo ago

Developer Tools●●Solid

NadirClaw, LLM router that cuts costs by routing prompts right

If you're burning through Claude/OpenAI credits, this is a low-friction stopgap: it classifies prompts in ~10ms and routes trivial tasks to cheaper/local models while reserving premium APIs for complex work. The agentic-task detection, reasoning-aware routing, session pinning and context-window fallback are practical touches that avoid mid-thread model bouncing and 429 failures. It isn't reinventing the space (OpenRouter and others exist), but it's focused on real-world cost tradeoffs and drop-in compatibility.

Solve My ProblemNiche Gem

amirdor

113mo ago