Back to browse
I built a human rights evaluator for HN (content vs. site behavior)

I built a human rights evaluator for HN (content vs. site behavior)

by 9wzYQbTYsAIc·Mar 4, 2026·3 points·2 comments

AI Analysis

●●SolidNiche GemSolve My ProblemEye Candy

Maps 806 HN stories to UDHR articles—novel corpus, but limited generalizability beyond HN.

Strengths
  • Rare dataset: 806 evaluated HN stories against UDHR provides empirical ground for 'do platforms match their values' narrative
  • Interactive heatmap + detailed 'says ≠ does' methodology (site tracks you, publishes privacy articles) catches real hypocrisy; narrative cohesion strong
Weaknesses
  • Scope: HN only; unclear how methodology scales to Reddit, Twitter, other platforms
  • Built in 8 days with Claude Code; no open implementation details—users trust black-box evaluation without verifying scoring logic
Category
Target Audience

Human rights researchers, tech policy advocates, HN community analysts, journalists investigating platform ethics

Similar To

Media Bias Fact Check · NewsGuard · Global Disinformation Index

Post Description

My health challenges limit how much I can work. I've come to think of Claude Code as an accommodation engine — not in the medical-paperwork sense, but in the literal one: it gives me the capacity to finish things that a normal work environment doesn't. Observatory was built in eight days because that kind of collaboration became possible for me. (I even used Claude Code to write this post — but am only posting what resonates with me.) Two companion posts: on the recursive methodology (https://blog.unratified.org/2026-03-03-recursive-methodology...) and what 806 evaluated stories reveal (https://blog.unratified.org/2026-03-03-what-806-stories-reve...).

I built Observatory to automatically evaluate Hacker News front-page stories against all 31 provisions of the UN Universal Declaration of Human Rights — starting with HN because its human-curated front page is one of the few feeds where a story's presence signals something about quality, not just virality. It runs every minute: https://observatory.unratified.org. Claude Haiku 4.5 handles full evaluations; Llama 4 Scout and Llama 3.3 70B on Workers AI run a lighter free-tier pass.

The observation that shaped the design: rights violations rarely announce themselves. An article about a company's "privacy-first approach" might appear on a site running twelve trackers. The interesting signal isn't whether an article mentions privacy — it's whether the site's infrastructure matches its words.

Each evaluation runs two parallel channels. The editorial channel scores what the content says about rights: which provisions it touches, direction, evidence strength. The structural channel scores what the site infrastructure does: tracking, paywalls, accessibility, authorship disclosure, funding transparency. The divergence — SETL (Structural-Editorial Tension Level) — is often the most revealing number. "Says one thing, does another," quantified.

Every evaluation separates observable facts from interpretive conclusions (the Fair Witness layer, same concept as fairwitness.bot — https://news.ycombinator.com/item?id=44030394). You get a facts-to-inferences ratio and can read exactly what evidence the model cited. If a score looks wrong, follow the chain and tell me where the inference fails.

Per our evaluations across 805 stories: only 65% identify their author — one in three HN stories without a named author. 18% disclose conflicts of interest. 44% assume expert knowledge (a structural note on Article 26). Tech coverage runs nearly 10× more retrospective than prospective: past harm documented extensively; prevention discussed rarely.

One story illustrates SETL best: "Half of Americans now believe that news organizations deliberately mislead them" (fortune.com, 652 HN points). Editorial: +0.30. Structural: −0.63 (paywall, tracking, no funding disclosure). SETL: 0.84. A story about why people don't trust media, from an outlet whose own infrastructure demonstrates the pattern.

The structural channel for free Llama models is noisy — 86% of scores cluster on two integers. The direction I'm exploring: TQ (Transparency Quotient) — binary, countable indicators that don't need LLM interpretation (author named? sources cited? funding disclosed?). Code is open source: https://github.com/safety-quotient-lab/observatory — the .claude/ directory has the cognitive architecture behind the build.

Find a story whose score looks wrong, open the detail page, follow the evidence chain. The most useful feedback: where the chain reaches a defensible conclusion from defensible evidence and still gets the normative call wrong. That's the failure mode I haven't solved. My background is math and psychology (undergrad), a decade in software — enough to build this, not enough to be confident the methodology is sound. Expertise in psychometrics, NLP, or human rights scholarship especially welcome. Methodology, prompts, and a 15-story calibration set are on the About page.

Thanks!

Similar Projects

AI/MLMid

Pipevals – a visual pipeline builder for evaluation-driven AI

Early learning project in a crowded eval space dominated by LangSmith and Arize.

Ship ItBold Bet
tilt
622mo ago