LLMWise – Compare, Blend, and Judge LLM Outputs from One API

Name: LLMWise – Compare, Blend, and Judge LLM Outputs from One API
Availability: InStock
Author: dm118

by dm118·Feb 20, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidSlickSolve My Problem

Multi-model orchestration with MoA blending and circuit-breaker failover, but LiteLLM and Anthropic Batch already exist.

Strengths

•Six distinct blending modes (MoA, self-MoA, consensus, council) reduce single-model failure risk.
•Real production patterns: health checks, circuit breakers, budget limits, latency tracing per-request.
•Familiar OpenAI-style API makes migration friction near-zero; BYOK (bring your own keys) included.

Weaknesses

•Crowded space: LiteLLM, Anthropic Batch, Replicate, Runwayml all handle multi-model routing.
•No evidence of cost advantage over direct API calls or superior output quality.
•Free tier (40 credits) is marketing, actual usage will hit paywall quickly.

Post Description

The core idea is that no single LLM is best at everything, so we built orchestration primitives that let you combine them intelligently via a single API.

Mixture-of-Agents (MoA): Our /blend endpoint implements multi-layer MoA. You send a prompt to 2-6 models in parallel, then each model refines its answer using the other models' outputs as reference material. This runs for 1-3 configurable layers before a synthesizer model produces the final response. We also built a Self-MoA variant: a single model generates 2-8 diverse candidates using temperature variation and distinct agent prompts ("prioritize correctness", "anticipate edge cases", "be skeptical"), then synthesizes the best parts. Six blend strategies total: consensus, council, best_of, chain, moa, and self_moa.

Circuit breakers: Every model has a health tracker with a classic closed to open to half-open state machine. Three consecutive failures trips the circuit for 30 seconds. When a model is down, mesh routing automatically skips it and tries the fallback chain, so no wasted latency on providers that are having a bad day. The SSE stream emits route events so you can see exactly what happened: trying, failed, skipped(circuit_open), trying, success. OpenRouter gets its own tuned thresholds (6 consecutive 429s, 20s cooldown) because rate limits there behave differently than hard failures.

Auto-router: model: "auto" does zero-overhead heuristic routing, pure regex classification, no LLM call. Code goes to GPT, math/creative goes to Claude, translation goes to Gemini Flash, etc. Simple, fast, and surprisingly effective for common queries.

Other things that were fun to build:

- Credit settlement with margin targeting: we reserve credits upfront, then reconcile against actual provider cost after the response completes - Per-user semantic memory via pgvector: conversations build retrievable context across sessions - BYOK encryption (Fernet/AES-128) so you can bring your own API keys and skip our billing entirely

The whole backend is async Python (FastAPI + asyncpg + LiteLLM), frontend is static Next.js served by the same FastAPI process in production. Single Docker image on Railway.

For the technically curious: https://llmwise.ai/llms-full.txt has the complete platform documentation in plain text, and there's also a machine-readable view at https://llmwise.ai/ai designed for AI agents to consume.