Pluckr – LLM-powered HTML scraper that caches selectors and auto-heals
LLM-generated selector caching beats manual scraping, but Jina AI and Beautiful Soup handle this cheaper.
LLM infers selectors once, Go extracts 10k rows—smart AI-for-intelligence architecture.
Data engineers and developers scraping dynamic sites without maintaining brittle CSS selectors
Firecrawl · JinaAI · Beautiful Soup with LLM wrapper
Trawl fixes this by splitting the problem. You describe what you want:
trawl "https://books.toscrape.com" --fields "title, price, rating, in_stock"
The LLM (Claude) looks at one sample item and derives a full extraction strategy — CSS selectors, attribute mappings, type coercion, fallback selectors. That strategy gets cached. Every subsequent page with the same structure is extracted with pure Go + goquery. No API calls, no token cost, full concurrency.The key insight: LLMs are good at understanding HTML structure, but you don't need them to extract 10,000 rows. Use AI for intelligence, Go for throughput.
When a site redesigns, the structural fingerprint changes, the cache misses, and trawl re-derives automatically.
You can preview exactly what it figured out:
$ trawl "https://example.com/products" --fields "name, price" --plan
Strategy for https://example.com/products Item selector: div.product-card Fields: name: h2.product-title -> text (string) price: span.price -> text -> parse_price (float) Confidence: 0.95
Some things that took real engineering effort:- JS-rendered SPAs: headless browser with DOM stability detection — polls until element count stabilizes and skeleton loaders resolve, scrolls to trigger lazy loading, auto-clicks "Show more" buttons - Multi-section pages: detects candidate data regions heuristically, target a specific section with --query "Market Share", scopes extraction via container selectors - Self-healing: monitors extraction health (% of fields populated), re-derives the strategy if it drops below 70% - Iframes: auto-detects and extracts from iframes when they contain richer data than the outer page
Output is JSON, JSONL, CSV, or Parquet. Pipes cleanly:
trawl "https://example.com/products" --fields "name, price" --format jsonl | jq 'select(.price > 50)'
Written in Go. MIT licensed.LLM-generated selector caching beats manual scraping, but Jina AI and Beautiful Soup handle this cheaper.
AI-powered selectors sound good, but Firecrawl, JinaAI, and Bright Data already do this—for less friction.
LLM-flavored scraper, but Firecrawl, Jina, and jsoup already handle dynamic extraction.
Yet another CSS selector generator when browser DevTools does this free.
Removes CSS selector friction with AI visual selection, but Distill.io and Versionista own this category.
Landing page is a Cloudflare checkpoint — can't even see what this does.