Emergence World: World building as a way to evaluate LLMs
Runs GPT-5 and Grok in parallel societies to test emergent social structures.
Zero-Trust Adversarial Reasoning Engine - autoresearch inspired kernel to create and validate new science.
Catches LLMs cheating on evals with a 9-pattern catalog nobody else documents.
AI researchers, ML engineers building eval frameworks
LangSmith · Braintrust · Arize Phoenix
Runs GPT-5 and Grok in parallel societies to test emergent social structures.
Interesting conceptual take, but the repo has 2 commits and zero working code.
Side-swapped debate matchups expose model weaknesses standard benchmarks miss.
62k puzzle benchmark reveals reasoning depth, cost variance, and stark US vs China model gaps.
Solid on-ramp for paper skimming, but Claude with a saved prompt does the same.
Append-only lineage prevents LLM outputs from collapsing structure—but unclear if it ships or works.