Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreif3ecewjk7qlasvm76vxcfagqfu7uxdg6q4ercqjsilbl5jbnu54u",
    "uri": "at://did:plc:jo3wjj2gx46alocis4wubmwr/app.bsky.feed.post/3mgzrw3yzodh2"
  },
  "path": "/blog/2026/03/14/wikisentences/",
  "publishedAt": "2026-03-13T23:30:00.000Z",
  "site": "https://thottingal.in",
  "tags": [
    "wikisentences",
    "ml-wiki-sentences",
    "tree-sitter-html",
    "recent article about this library",
    "enterprise HTML dumps"
  ],
  "textContent": "I’m excited to announce two new resources for natural language processing researchers and developers:\n\n  1. **wikisentences** - A Rust-based tool for extracting sentence datasets from Wikipedia dumps in any language\n  2. **ml-wiki-sentences** - A dataset of 2.25 million Malayalam sentences extracted from Wikipedia, now available on HuggingFace, prepared using the above tool.\n\n\n\n## The Wikisentences Tool\n\nThe wikisentences project provides a complete pipeline for creating sentence datasets from Wikipedia content:\n\n### Core Technology\n\n  * **wiki-html-text-extractor** (Rust) - Uses tree-sitter-html to parse article HTML and extract clean plain text\n  * **sentencex** (Rust) - Handles accurate sentence segmentation across languages. See my recent article about this library\n\n\n\n### Four-Stage Pipeline\n\n  1. **Download** enterprise HTML dumps from WikimediaThere is no recent html dumps for wikipedia, except this one year old dump\n  2. **Convert** JSON dumps to Parquet format (id, name, url, language, html)\n  3. **Extract** plain text from HTML (id, url, name, text)\n  4. **Segment** text into sentences (id, url, name, sentence, sentence_index)\n\n\n\nEach stage is handled by a separate Python script, with the heavy lifting done by efficient Rust binaries. The pipeline is designed to be memory-efficient, streaming data between stages without writing intermediate files to disk.",
  "title": "Preparing sentence dataset from a wikipedia"
}