GitHub Repository

A command-line tool to extract plain text from Wikipedia dumps with category and section filtering

195 starsRuby

WP2TXT – Wikipedia dump text extractor with category/section filtering

Name: WP2TXT – Wikipedia dump text extractor with category/section filtering
Availability: InStock
Author: yohasebe

by yohasebe·Feb 21, 2026·3 points·0 comments

Visit Project View on HN

AI Analysis

●●●BangerNiche GemDark Horse

Category-aware Wikipedia text extraction with 20-year maintenance history and parallel M4 speed.

Strengths

•Maintained since 2006; rare longevity and real-world validation from corpus linguistics community
•Category recursion + section filtering solves specific research use case (e.g., 'extract plot sections from sci-fi articles') that generic dumps don't address
•Template expansion (dates, units, coordinates) and content markers ([TABLE], [MATH]) preserve research-grade data fidelity

Weaknesses

•Niche appeal: only valuable to researchers already dealing with Wikipedia dumps; narrow market
•Dependency on Ruby + bzip2 tools adds setup friction vs monolithic binary alternatives

Post Description

WP2TXT is a command-line tool that extracts plain text from Wikipedia dump files. I originally built it in 2006 for corpus linguistics research and have maintained it since. The latest version (2.1) was largely rewritten with features for selective extraction:

- Auto-download dumps by language code (350+ languages) - Extract specific articles by title without downloading the full dump - Extract articles from a Wikipedia category with subcategory recursion - Extract specific sections by name with alias matching (e.g., "Plot" also matches "Synopsis") - Template expansion (dates, coordinates, unit conversions → readable text) - Content type markers ([MATH], [TABLE], etc.) instead of silent removal - Category metadata preserved in output - JSON/JSONL output - Parallel processing (English Wikipedia 24 GB dump: ~2 hours on Apple M4) - Written in Ruby.