Trawl – Scrape any site with natural language fields, not CSS selectors

Name: Trawl – Scrape any site with natural language fields, not CSS selectors
Availability: InStock
Author: trawlcli

by trawlcli·Mar 8, 2026·8 points·2 comments

Visit Project View on HN

AI Analysis

●●●BangerBig BrainShip ItSolve My Problem

LLM infers selectors once, Go extracts 10k rows—smart AI-for-intelligence architecture.

Strengths

•One-shot LLM call per site structure, then pure Go at full concurrency—token-efficient scaling
•Auto-recovery: site redesign triggers new LLM inference, no silent failures
•Handles JS-rendered SPAs and iframes with --js flag, CSS fallbacks for resilience

Weaknesses

•Caching strategy unclear—how does it detect site redesigns reliably without stale cache risks?
•No mention of rate-limiting, User-Agent rotation, or anti-scraping detection handling

Post Description

Every scraper I've written has the same failure mode: it works for three months, a site redesigns, and my CSS selectors silently return empty strings. The data is still right there on the page — a human can find it instantly — but the scraper is blind.

Trawl fixes this by splitting the problem. You describe what you want:

trawl "https://books.toscrape.com" --fields "title, price, rating, in_stock"

The LLM (Claude) looks at one sample item and derives a full extraction strategy — CSS selectors, attribute mappings, type coercion, fallback selectors. That strategy gets cached. Every subsequent page with the same structure is extracted with pure Go + goquery. No API calls, no token cost, full concurrency.

The key insight: LLMs are good at understanding HTML structure, but you don't need them to extract 10,000 rows. Use AI for intelligence, Go for throughput.

When a site redesigns, the structural fingerprint changes, the cache misses, and trawl re-derives automatically.

You can preview exactly what it figured out:

$ trawl "https://example.com/products" --fields "name, price" --plan

Strategy for https://example.com/products Item selector: div.product-card Fields: name: h2.product-title -> text (string) price: span.price -> text -> parse_price (float) Confidence: 0.95

Some things that took real engineering effort:

- JS-rendered SPAs: headless browser with DOM stability detection — polls until element count stabilizes and skeleton loaders resolve, scrolls to trigger lazy loading, auto-clicks "Show more" buttons - Multi-section pages: detects candidate data regions heuristically, target a specific section with --query "Market Share", scopes extraction via container selectors - Self-healing: monitors extraction health (% of fields populated), re-derives the strategy if it drops below 70% - Iframes: auto-detects and extracts from iframes when they contain richer data than the outer page

Output is JSON, JSONL, CSV, or Parquet. Pipes cleanly:

trawl "https://example.com/products" --fields "name, price" --format jsonl | jq 'select(.price > 50)'

Written in Go. MIT licensed.