Preparing sentence dataset from a wikipedia
en.planet.wikimedia.org [Unofficial]
March 13, 2026
I’m excited to announce two new resources for natural language processing researchers and developers:
- wikisentences - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language
- ml-wiki-sentences - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.
The Wikisentences Tool
The wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content:
Core Technology
- wiki-html-text-extractor (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text
- sentencex (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library
Four-Stage Pipeline
- Download enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump
- Convert JSON dumps to Parquet format (id, name, url, language, html)
- Extract plain text from HTML (id, url, name, text)
- Segment text into sentences (id, url, name, sentence, sentence_index)
Each stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.
Discussion in the ATmosphere