Back to browse
GitHub Repository

A command-line tool to extract plain text from Wikipedia dumps with category and section filtering

194 starsRuby

WP2TXT – Wikipedia dump text extractor with category/section filtering

by yohasebe·Feb 21, 2026·3 points·0 comments

AI Analysis

●●●BangerNiche GemDark Horse

Category-aware Wikipedia text extraction with 20-year maintenance history and parallel M4 speed.

Strengths
  • Maintained since 2006; rare longevity and real-world validation from corpus linguistics community
  • Category recursion + section filtering solves specific research use case (e.g., 'extract plot sections from sci-fi articles') that generic dumps don't address
  • Template expansion (dates, units, coordinates) and content markers ([TABLE], [MATH]) preserve research-grade data fidelity
Weaknesses
  • Niche appeal: only valuable to researchers already dealing with Wikipedia dumps; narrow market
  • Dependency on Ruby + bzip2 tools adds setup friction vs monolithic binary alternatives
Category
Target Audience

Corpus linguists, NLP researchers, Wikipedia data miners

Similar To

mwclient · pywikibot · Mediawiki API

Post Description

WP2TXT is a command-line tool that extracts plain text from Wikipedia dump files. I originally built it in 2006 for corpus linguistics research and have maintained it since. The latest version (2.1) was largely rewritten with features for selective extraction:

- Auto-download dumps by language code (350+ languages) - Extract specific articles by title without downloading the full dump - Extract articles from a Wikipedia category with subcategory recursion - Extract specific sections by name with alias matching (e.g., "Plot" also matches "Synopsis") - Template expansion (dates, coordinates, unit conversions → readable text) - Content type markers ([MATH], [TABLE], etc.) instead of silent removal - Category metadata preserved in output - JSON/JSONL output - Parallel processing (English Wikipedia 24 GB dump: ~2 hours on Apple M4) - Written in Ruby.

Similar Projects