I over-engineered a home security camera that uses an LLM and talks

Name: I over-engineered a home security camera that uses an LLM and talks
Availability: InStock
Author: thecal

by thecal·Mar 8, 2026·1 point·0 comments

Visit Project View on HN

AI Analysis

●●SolidWizardryRabbit HoleNiche Gem

Replaces Google Nest's $20/month cloud analysis with local Qwen 35B and a 3D-printed head.

Strengths

•Two-stage compute pipeline (lightweight motion detection + heavy LLM analysis) is pragmatic and power-aware.
•Fully local inference; zero cloud dependency or subscription fees beats proprietary competitors.
•3D-printed enclosure and audio output make it a complete, functional project (not just code).

Weaknesses

•Requires separate GPU-equipped PC (RTX 3090 mentioned) to run Qwen 35B; high barrier to entry for typical users.
•No evaluation framework for accuracy/false positives; unclear how it compares to actual Nest Premium in practice.

Post Description

Roz is an open-source, Python-based pipeline that captures a webcam feed, detects motion, sends the frames to a local Vision LLM to analyze the scene, and then uses text-to-speech to audibly announce any meaningful changes.

The Backstory: I heard an ad for Google’s Home Premium Advanced service, which claims to analyze your Nest doorbell images and describe what it sees. I thought that sounded cool, but I didn't want to pay $20/month for it or send my camera feeds to the cloud. I wanted to see if I could build a localized, subscription-free version myself.I 3D-printed a head-shaped enclosure to house the camera and speaker.

How it works:

1. Stage 1 (Lightweight): Python app runs on a low-power device (like a Raspberry Pi 4) using OpenCV to perform basic frame-differencing. This takes barely any compute. 2. Stage 2 (Heavy): When motion crosses a configurable threshold, the frames are sent to a vision-capable LLM. (I'm using Qwen3.5 35B hosted on a separate PC with an RTX 3090, but any OpenAI-compatible endpoint like vLLM or llama.cpp works). 3. Stage 3 (Audio): The LLM compares the current scene to the previous baseline context. If there is a meaningful change, the LLM generates a text description of what it sees, which is then read out loud locally via Piper TTS.

Hardware & trying it out: Since this relies on physical hardware, the easiest way to see it in action is the demo video in the README (make sure to unmute the audio).

The hardest part so far is the subjectivity of what constitutes a "meaningful change". I'm still tweaking the prompt rules to hit the sweet spot between "announce everything" and "miss important events".

Similar Projects

AI/ML●●Solid

Sentinel Core – Open-source AI video search and detection engine

They stitched together sensible primitives — YOLO for real‑time detection and CLIP for natural‑language search — and packaged it for truly local deployments (SQLite "sovereign memory", MPS/CUDA hints, RTSP ingest). The README and demo show a coherent pipeline and multi‑camera focus, which is rarer than you'd think; what I'd like to see next is clearer install docs, reproducible benchmarks, and more evidence of robustness under production loads.

WizardryNiche Gem

ab-abg

124mo ago

Security●●Solid