Preparing sentence dataset from a wikipedia

en.planet.wikimedia.org [Unofficial] March 13, 2026

Source

I’m excited to announce two new resources for natural language processing researchers and developers:

wikisentences - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language
ml-wiki-sentences - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.

The Wikisentences Tool

The wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content:

Core Technology

wiki-html-text-extractor (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text
sentencex (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library

Four-Stage Pipeline

Download enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump
Convert JSON dumps to Parquet format (id, name, url, language, html)
Extract plain text from HTML (id, url, name, text)
Segment text into sentences (id, url, name, sentence, sentence_index)

Each stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.

The Wikisentences Tool

Core Technology

Four-Stage Pipeline

Discussion in the ATmosphere