I fixed my AI goose tutor to stop punishing understanding

Name: I fixed my AI goose tutor to stop punishing understanding
Availability: InStock
Author: zapseo

by zapseo·May 20, 2026·3 points·2 comments

Visit Project View on HN

AI Analysis

●●SolidCozySolve My ProblemBig Brain

Understanding meter that drops when you hand-wave instead of explaining mechanisms.

Strengths

•Live 'understanding bar' visualizes confidence based on explanation depth, not keyword matching.
•Strictness modes let you toggle from 'breezy' to 'razor' depending on how hard you want to be pushed.
•Session memory tracks which concepts you dodged and resumes questioning from there next time.

Weaknesses

•No export or flashcard generation — sessions vanish after the chat ends unless you manually save notes.
•Premium gating on stricter modes limits utility for serious learners who need the toughest feedback.

Post Description

a few weeks ago, I built professor goose, a socratic ai tutor built around the feynman idea that if you can’t explain it, you don’t understand it. the way professor goose works specifically is you pick a topic, rubber duck at a 3d goose, and instead of answering it asks follow up questions until it understands.

that begs the question, how does the goose know it has understood?

that’s when I thought of an understanding bar - always available to the user to help visualize how much the goose understands you, 0 -> 100%.

the original logic powering the understanding bar went something like this: every turn, id send the convo to an llm and ask it to return a number 0-100 , with a rubric of brackets to make the output less volatile. 0-10 meant no real understanding. 11-20 named , but empty. 21-35 meant a partial understanding, and so on, up to 93-100 for the goose understanding your topic exceptionally. this approach worked. mostly. until I started looking at what came back once real users tested the goose.

two testers were explaining the basic way a cpu works. the first used textbook style definition, (fetch, decode , execute etc) and got a final understanding of 87% after a couple turns. the second used a real world example of a chef, linking it to concepts of a cpu. same level of understanding, expressed differently. the second tester got a score of 36. id built the opposite of what I wanted, a tutor rewarding parroting.

checking into the data to find the source of the variances I noticed if I put the same paragraph verbatim in, and got 5 varying scores out: 51,66,51,70,51. the brackets kind of stabilized the results, but the score was unexplainable. why 66 and not 70? nothing in the system could tell me, the limit just picked.

the fix was to stop adding the model to be the math , and make a new system. now every session gets a ‘flight plan’ when the session has a meaningful topic. a separate llm call generates 3-4 essential subconcepts a real explanation must cover. eg for photosynthesis: what it uses, what it produces, why plants need it. each turn the goose’s evaluator returns discrete depth updates per waypoint (0-3, from not addressed, named, stated, explained in own words), plus any misconceptions which were spotted. Javascript makes sure depth only moves up (like a ratchet), weighted coverage, the gate to finish(wrap) a session, and the flow to repair a misconception.

what if the user introduces a subtopic the the plan didn’t anticipate?

in that case, the system decides whether to amend the plan mid session, with a backfill evaluation to credit prior turns. i also added 5 levels of intelligence to the goose, (breezy to razor sharp) which each make the model judge objective depth, then code decides what’s enough. the same chef analogy now scores 87, because the evaluation prompt explicitly tells the llm the waypoints ideal answer is just a valid framing, not the only one.

to validate these changes, I sat down and acted as 15 different types of users, typing differently explaining differently etc, then made changes based on response and iterated. a little bug I found was the llm evaluator giving credit to the wrong actor - the goose teaching via analogy and the student getting credit for it, fixed that too.

lesson worth keeping: if you build anything an llm needs to rate or rank by number, don’t trust it, give it something discrete, not subjective, otherwise they will fake and hallucinate.

professor goose is live if you want to try it!

Similar Projects

Education●●Solid

I solved my study problems by talking to a goose

Feynman technique via voice chat with a goose avatar that actually works.

CozySolve My Problem

polaritymaking

221021d ago

Education●Mid

I built a "Socratic" AI to stop my daughter from copy-pasting homework

Socratic AI tutor, but ChatGPT already refuses answers—positioning isn't enough.

Bold Bet

qurio_dev

35393mo ago

Education●Mid

Depth Check – Your AI tutor to learn about anything

Educational AI wrapper—Duolingo, Khan Academy AI, and paid tutors already own this space.

gcrowne13

203mo ago

Education●●Solid

Knowable, the AI tutor that follows your work on paper

Camera watches your actual homework while AI asks questions instead of giving answers.

Niche GemCozy

samuelzxu

208h ago

Developer Tools●Mid

Proof of Thought (Pot)

The core idea—make the assistant refuse to write code until you prove you understand the problem—is a sharp behavioral hack that could really change how people use copilots. The repo/article shows a mode table (temperature, allowed tools like write/edit/bash/read/grep) and an explicit permission model, which is a useful blueprint. Right now it reads like a well-argued workflow and config proposal rather than a plug-and-play tool: I'd want an editor/CLI integration or enforcement layer before I call it a must-use.

Big BrainNiche Gem

ekadet

103mo ago

Developer Tools●Mid

We Fixed Code Throughput. Understanding Is Now the Bottleneck

Persistent context layer beats Cursor's session amnesia on large codebases.

Bold Bet

Craze0

421mo ago