{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreif3ecewjk7qlasvm76vxcfagqfu7uxdg6q4ercqjsilbl5jbnu54u",
"uri": "at://did:plc:jo3wjj2gx46alocis4wubmwr/app.bsky.feed.post/3mgzrw3yzodh2"
},
"path": "/blog/2026/03/14/wikisentences/",
"publishedAt": "2026-03-13T23:30:00.000Z",
"site": "https://thottingal.in",
"tags": [
"wikisentences",
"ml-wiki-sentences",
"tree-sitter-html",
"recent article about this library",
"enterprise HTML dumps"
],
"textContent": "I’m excited to announce two new resources for natural language processing researchers and developers:\n\n 1. **wikisentences** - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language\n 2. **ml-wiki-sentences** - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.\n\n\n\n## The Wikisentences Tool\n\nThe wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content:\n\n### Core Technology\n\n * **wiki-html-text-extractor** (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text\n * **sentencex** (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library\n\n\n\n### Four-Stage Pipeline\n\n 1. **Download** enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump\n 2. **Convert** JSON dumps to Parquet format (id, name, url, language, html)\n 3. **Extract** plain text from HTML (id, url, name, text)\n 4. **Segment** text into sentences (id, url, name, sentence, sentence_index)\n\n\n\nEach stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.",
"title": "Preparing sentence dataset from a wikipedia"
}