External Publication

From Tokens to Text: A Trigram Markov Model for Malayalam

en.planet.wikimedia.org [Unofficial] February 27, 2026

Ever wondered how a computer learns to generate text that actually looks like Malayalam? Not just random characters, but something with actual structure? I’m not talking about Large Language Models here. I’m talking about Small Language Models that are efficient and explainable. something you can build and run on your own laptop. In my previous “The Broken Token” article, I presented a Malayalam unigram tokenizer and analysed its strengths and weaknesses. I did fertility rate evaluation and then analysed the tokenization in the context of Malayalam language characteristics. A common evaluation method for tokenizers is using them in downstream tasks—so I decided to build a text generator. That’s where things got interesting.

Discussion in the ATmosphere