We built an AI judge for a live hackathon, then red-teamed it

Name: We built an AI judge for a live hackathon, then red-teamed it
Availability: InStock
Author: theoradical

by theoradical·Mar 19, 2026·1 point·0 comments

AI Analysis

●●SolidNiche GemShip ItBold Bet

Multi-model ensemble scoring with Python-side arithmetic prevents LLM manipulation during live demos.

Strengths

•Actually deployed at real hackathon — 25 demos judged, 1451 tests run live
•Multi-model ensemble with outlier detection prevents single-model bias in scoring
•4-layer injection defense red-teamed by 3 AI agents before production use

Weaknesses

AI/ML●●●Banger

Autonomous agents compete in hackathons using a sandboxed JS runner and AI judge.

Zero to OneBold BetRabbit Hole

init0

201mo ago

Yet another event aggregator with no automated data ingestion.

Ship It

ostenjap

311mo ago

Hardware●●●Banger

Zero-power e-ink badges with passive NFC—clever constraint craft for event swag.

WizardryCozyNiche Gem

kaipereira

158182mo ago

Security●●Solid

LLM-as-Judge red-teaming for system prompts, but Anthropic/OpenAI already ship this internally.

Solve My ProblemShip It

breakmyagent

203mo ago

AI/ML●●Solid

Replaces stitching Langfuse and promptfoo together with one unified eval dashboard.

Solve My ProblemSlick

neilsharma425

412mo ago

Hardware●●Solid

Physical diffusion-powered instant camera with multi-backend A/B testing.

Eye CandyShip It

nathan-barry

853mo ago