Emergence World: World building as a way to evaluate LLMs

Name: Emergence World: World building as a way to evaluate LLMs
Availability: InStock
Author: deepakakkil

by deepakakkil·May 15, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerBold BetBig BrainWizardry

Runs GPT-5 and Grok in parallel societies to test emergent social structures.

Strengths

•Moves beyond static benchmarks to dynamic, multi-agent pressure testing.
•Architecture allows direct comparison of model behaviors in identical environments.
•Tracks long-horizon consistency and behavioral drift over simulated days.

Weaknesses

•Results are currently observational rather than providing actionable optimization metrics.
•High compute cost for running multiple foundation models in continuous loops.

Post Description

Current LLM benchmarks are broken. We think long horizon "world" building could be an interesting additional way to evaluate LLMs, since it combines many aspects such as need for advanced reasoning, tool calling, working under large context window stress, safety, social and survival pressure from the world. For this we released Emergence World. Our first study ran 5 different parallel world, each powered by OpenAI (GPT-5-Mini), XAI (Grok-4.1), Claude (Sonnet 4.6), Gemini (3-Flash), and a world with mix of models. Early results in the website.

Similar Projects

AI/ML●●Solid

AptSelect – A local LLM client for parallel testing and evaluation

Parallel LLM testing across providers when LangSmith costs way more.

Solve My ProblemNiche Gem

dhavalt

301mo ago

AI/ML●●Solid

An adversarial reasoning engine for scientific progress

Catches LLMs cheating on evals with a 9-pattern catalog nobody else documents.

Big BrainNiche Gem

Sparckix

201mo ago

AI/ML●●Solid

TweakIdea – 14-dimension startup idea evaluation in Claude Code

Fourteen parallel Claude agents grade your startup idea's evidence before you quit your job.

Big BrainNiche Gem

ephx

103mo ago

AI/ML●●Solid

Beval – Simple evaluations for your AI product

CSV-based evals beat LangSmith for quick PM checks without the infra headache.

Solve My Problem

raviisoccupied

214mo ago

AI/ML●●●Banger

Republic of Agents: Benchmark for Social Reasoning in LLMs

Mafia-as-benchmark with learning-between-batches mechanism; public, inspectable sessions.

Zero to OneBig BrainWizardry

kkonstantin

104mo ago

AI/ML●●Solid

Valohai LLM – Track and compare LLM evaluation results in one dashboard

Streams evals from a tiny Python client into a shared dashboard and lets you run parameter sweeps and compare up to six configurations with radar/bar charts and scorecards — exactly the sort of tooling that stops results getting lost in notebooks. Useful, pragmatic product for teams who repeatedly evaluate models, but it's competing with general observability/experiment trackers (W&B, Neptune) and will need strong integrations and metric flexibility to stand out.

Niche GemSolve My Problem

radicain

305mo ago