Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiasbqw6cw4gedbcyo76zcufe446xuziernr34xxgeizozg7zb3t4m",
    "uri": "at://did:plc:jo3wjj2gx46alocis4wubmwr/app.bsky.feed.post/3mfwszbf7hcv2"
  },
  "path": "/blog/2026/02/28/malayalam-markov-chain/",
  "publishedAt": "2026-02-27T23:30:00.000Z",
  "site": "https://thottingal.in",
  "tags": [
    "The Broken Token"
  ],
  "textContent": "Ever wondered how a computer learns to generate text that actually looks like Malayalam? Not just random characters, but something with actual structure? I’m not talking about Large Language Models here. I’m talking about Small Language Models that are efficient and explainable. something you can build and run on your own laptop.\n\nIn my previous “The Broken Token” article, I presented a Malayalam unigram tokenizer and analysed its strengths and weaknesses. I did fertility rate evaluation and then analysed the tokenization in the context of Malayalam language characteristics. A common evaluation method for tokenizers is using them in downstream tasks—so I decided to build a text generator. That’s where things got interesting.",
  "title": "From Tokens to Text: A Trigram Markov Model for Malayalam"
}