Back to browse
LLMWise – Compare, Blend, and Judge LLM Outputs from One API

LLMWise – Compare, Blend, and Judge LLM Outputs from One API

by dm118·Feb 20, 2026·1 point·0 comments

AI Analysis

●●SolidSlickSolve My Problem

Multi-model orchestration with MoA blending and circuit-breaker failover, but LiteLLM and Anthropic Batch already exist.

Strengths
  • Six distinct blending modes (MoA, self-MoA, consensus, council) reduce single-model failure risk.
  • Real production patterns: health checks, circuit breakers, budget limits, latency tracing per-request.
  • Familiar OpenAI-style API makes migration friction near-zero; BYOK (bring your own keys) included.
Weaknesses
  • Crowded space: LiteLLM, Anthropic Batch, Replicate, Runwayml all handle multi-model routing.
  • No evidence of cost advantage over direct API calls or superior output quality.
  • Free tier (40 credits) is marketing, actual usage will hit paywall quickly.
Target Audience

Backend engineers, AI teams reducing LLM risk, API builders needing multi-model inference.

Similar To

LiteLLM · Anthropic Batch API · Replicate

Post Description

The core idea is that no single LLM is best at everything, so we built orchestration primitives that let you combine them intelligently via a single API.

Mixture-of-Agents (MoA): Our /blend endpoint implements multi-layer MoA. You send a prompt to 2-6 models in parallel, then each model refines its answer using the other models' outputs as reference material. This runs for 1-3 configurable layers before a synthesizer model produces the final response. We also built a Self-MoA variant: a single model generates 2-8 diverse candidates using temperature variation and distinct agent prompts ("prioritize correctness", "anticipate edge cases", "be skeptical"), then synthesizes the best parts. Six blend strategies total: consensus, council, best_of, chain, moa, and self_moa.

Circuit breakers: Every model has a health tracker with a classic closed to open to half-open state machine. Three consecutive failures trips the circuit for 30 seconds. When a model is down, mesh routing automatically skips it and tries the fallback chain, so no wasted latency on providers that are having a bad day. The SSE stream emits route events so you can see exactly what happened: trying, failed, skipped(circuit_open), trying, success. OpenRouter gets its own tuned thresholds (6 consecutive 429s, 20s cooldown) because rate limits there behave differently than hard failures.

Auto-router: model: "auto" does zero-overhead heuristic routing, pure regex classification, no LLM call. Code goes to GPT, math/creative goes to Claude, translation goes to Gemini Flash, etc. Simple, fast, and surprisingly effective for common queries.

Other things that were fun to build:

- Credit settlement with margin targeting: we reserve credits upfront, then reconcile against actual provider cost after the response completes - Per-user semantic memory via pgvector: conversations build retrievable context across sessions - BYOK encryption (Fernet/AES-128) so you can bring your own API keys and skip our billing entirely

The whole backend is async Python (FastAPI + asyncpg + LiteLLM), frontend is static Next.js served by the same FastAPI process in production. Single Docker image on Railway.

For the technically curious: https://llmwise.ai/llms-full.txt has the complete platform documentation in plain text, and there's also a machine-readable view at https://llmwise.ai/ai designed for AI agents to consume.

Similar Projects