Back to browse
GitHub Repository

N-Way Self-Deliberating Evaluation Engine core for AI agent orchestration

11 starsRust

NSED is public – Mixture-of-Models to Hit SOTA using self-hosted AI

by t_peersky·Feb 18, 2026·4 points·0 comments

AI Analysis

●●●BangerWizardryBig BrainBold Bet

Three 8B-20B models beat GPT-5 at math via mixture-of-experts voting, fully local.

Strengths
  • Genuine algorithmic insight: quadratic voting prevents model collapse and fixes naive majority voting's ceiling (54% → 84%)
  • Paper-backed (arXiv 2601.16863) with reproducible AIME 2025 benchmarks—not hand-waved claims
  • Enterprise-ready architecture: NATS bus, cost tracking, audit trails, human-in-the-loop injection
Weaknesses
  • BSL 1.1 license limits commercial adoption—source-available, not open-source, despite framing
  • Requires 64GB VRAM and orchestration overhead; benchmarks are on math problems, not general reasoning
Category
Target Audience

AI researchers, enterprises running reasoning-heavy workloads, teams wanting local AI inference

Similar To

DeepSeek-R1 · Mixture of Agents (Together AI) · Dottie (Constitutional AI voting)

Post Description

Hey HN, We're open-sourcing (source-available, BSL 1.1, patent pending) the orchestrator behind our paper benchmark results. NSED (N-Way Self-Evaluating Deliberation) is a Rust binary that coordinates multiple LLMs through structured rounds of proposals and cross-evaluation, using quadratic voting to prevent any single model from dominating the consensus.

The result: Three open-weight models (20B, 8B, 12B) on consumer GPUs — 64GB total VRAM, ~$7K hardware — score 84% on AIME 2025. The same models individually or with naive majority voting score ~54%. That's frontier-model performance on hardware you can buy at Micro Center.

How it works:

Each agent independently proposes a solution Every agent evaluates every other agent's work Scores aggregate via quadratic voting (cost of influence grows quadratically → no single model can dominate) Repeat. Agents see prior results, refine, re-evaluate System converges toward the highest-quality answer through adversarial cross-checking

It's provider-agnostic — mix Ollama, vLLM, OpenAI, Anthropic, or any OpenAI-compatible endpoint in the same deliberation. Everything streams over NATS JetStream with full persistence: every proposal, evaluation, score, and reasoning trace is logged and streamable via SSE.

Paper: arxiv.org/abs/2601.16863 Happy to answer questions about the architecture, the quadratic voting mechanism, benchmark methodology, or anything else.

Similar Projects