External Publication
Visit Post

Preparing sentence dataset from a wikipedia

en.planet.wikimedia.org [Unofficial] March 13, 2026
Source

I’m excited to announce two new resources for natural language processing researchers and developers:

  1. wikisentences - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language
  2. ml-wiki-sentences - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.

The Wikisentences Tool

The wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content:

Core Technology

  • wiki-html-text-extractor (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text
  • sentencex (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library

Four-Stage Pipeline

  1. Download enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump
  2. Convert JSON dumps to Parquet format (id, name, url, language, html)
  3. Extract plain text from HTML (id, url, name, text)
  4. Segment text into sentences (id, url, name, sentence, sentence_index)

Each stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.

Discussion in the ATmosphere

Loading comments...