Back to browse
Clipto-fully local, natural language search over terabytes of media

Clipto-fully local, natural language search over terabytes of media

by henry_kang·Jun 2, 2026·2 points·0 comments

AI Analysis

●●●BangerBig BrainSolve My Problem

Local Whisper and Qwen3.5 vision models search terabytes of media without cloud uploads.

Strengths
  • Combines ASR, face detection, and vision models entirely on-device for privacy.
  • Graph data structure links people, actions, and scenes for complex natural queries.
  • Premiere Pro integration keeps video editors in their existing professional workflow.
Weaknesses
  • Requires 24GB+ memory and M1+ Mac, excluding most Windows users entirely.
  • Clipto is an existing popular note-taking app, causing significant brand confusion.
Category
Target Audience

Video editors, photographers, and content creators

Similar To

Google Photos · Digikam · Adobe Lightroom

Post Description

Hey HN,

We recently built Clipto. It’s a tool that lets you search over terabytes of video, audio, and images on your computer, without relying on the cloud.

Motivation: we probably all had this similar experience, we know a moment exists in a video or audio, but finding it takes hours scrubbing the timeline. You can send all the media to process in the cloud, but it’s slow, expensive and raises privacy concerns. So we decided to build our own on-device media search engine.

How it works (high level):

1. We ingest video, audio, image; normalize formats via ffmpeg; run content analysis to downsample the frames for deeper understanding.

2. A local ASR pipeline (optimized Whisper) transcribes speech into text and speakers are identified; faces are detected and if known, person id created; a vision model (optimized Qwen3.5) runs on the downsampled frames to detect scenes, actions, objects, OCR and visual descriptions.

3. A graph data structure ties everything together into a searchable memory.

4. At runtime, user’s query and intention are understood by a lightweight local language model. Graph search conducted to retrieve all the matching clip candidates and reranking is done by a reranking model.

5. All the processes are done on your computer, without touching our servers.

Right now, it runs best on Apple Silicon Macs with 24GB+ memory, but we are working on broader support as well as an API/MCP for other agents to call.

We’d love to hear your feedback. Feel free to ask anything!

Similar Projects

Hardware●●●Banger

An Open-Source Yoto Toy with Qwen3-TTS

Voice cloning on ESP32 without cloud beats Yoto's subscription model completely.

WizardryZero to OneDark Horse
akadeb
312mo ago