Back to browse
Yardstiq – Compare LLM outputs side-by-side in your terminal

Yardstiq – Compare LLM outputs side-by-side in your terminal

by stanleycyang·Mar 3, 2026·2 points·0 comments

AI Analysis

●●SolidSolve My ProblemSlickCrowd Pleaser

One-command model comparison with real-time streaming and performance metrics beats tab-switching.

Strengths
  • Eliminates genuine friction (copy-paste between tabs); streaming side-by-side layout is faster than sequential API calls.
  • Unified auth (Vercel AI Gateway or individual keys), local Ollama support, and multiple export formats (JSON, Markdown, HTML).
  • AI judge mode and YAML benchmark suites enable repeatable, scored evaluation across model cohorts.
Weaknesses
  • No validation shown for judge scoring accuracy or reproducibility; 'AI evaluates responses' is trendy but unproven.
  • Solves a developer-experience problem, not a technical one; CyberChef, Anthropic Workbench, and OpenAI Playground offer similar workflows.
Target Audience

LLM app developers, prompt engineers, model evaluators

Similar To

Anthropic Workbench · OpenAI Playground · LMStudio

Post Description

Hey HN,

I built yardstiq because I got tired of the copy-paste workflow for comparing LLM responses when developing apps. Every time I wanted to see how Claude vs GPT vs Gemini handled the same prompt, I'd open three tabs, paste the same thing, and try to eyeball the differences. It's 2026 and we have 40+ models worth considering — that doesn't scale.

yardstiq is a CLI tool that sends one prompt to multiple models simultaneously and streams the responses side-by-side in your terminal. It also tracks performance metrics (time to first token, tokens/sec, cost) and optionally runs an AI judge to score the outputs.

``` npx yardstiq "Explain quicksort in 3 sentences" -m claude-sonnet -m gpt-4o ```

What it does:

- Streams responses from multiple models in parallel, rendered in columns - Shows TTFT, throughput (tok/s), token counts, and cost per request - AI judge mode: have a model evaluate and score the responses - Export to JSON, Markdown, or self-contained HTML reports - Run YAML-defined benchmark suites across models with aggregate scoring - Works with Ollama for local model comparisons (zero API cost) - Supports 40+ models via direct provider keys or Vercel AI Gateway

I built this mostly for my own workflow — picking models for different tasks, testing prompt variations, and running quick benchmarks without setting up a whole evaluation framework. It's not trying to replace serious eval platforms, just make the "which model is better for X?" question answerable in 10 seconds.

MIT licensed, written in TypeScript: https://github.com/stanleycyang/yardstiq

Happy to answer questions about the architecture or benchmarking approach.

Similar Projects

AI/ML●●Solid

Why use one AI model when you can use all of them at once!

One prompt, many models — that simple idea is executed with practical extras: independent conversation threads per model, full-text history/search, and bring‑your‑own API keys so you don't copy/paste. The landing page sells the daily‑driver vibe (lifetime one-time pricing is an attention grabber), but the concept itself is not novel; I'd want clearer UI for cost controls, API key security and model/version management before trusting it for heavy use.

SlickSolve My Problem
lurker325
103mo ago
Developer Tools●●Solid

NadirClaw, LLM router that cuts costs by routing prompts right

If you're burning through Claude/OpenAI credits, this is a low-friction stopgap: it classifies prompts in ~10ms and routes trivial tasks to cheaper/local models while reserving premium APIs for complex work. The agentic-task detection, reasoning-aware routing, session pinning and context-window fallback are practical touches that avoid mid-thread model bouncing and 429 failures. It isn't reinventing the space (OpenRouter and others exist), but it's focused on real-world cost tradeoffs and drop-in compatibility.

Solve My ProblemNiche Gem
amirdor
113mo ago