External Publication

Multilingual AI is not just an access problem

Julie Belião May 25, 2026

In 2024, after moderating a panel at TAUS on multilingual AI, I wrote about why “multilingual” too often still means English-first: English-first data, English-first evaluation, English-first assumptions, then a localization layer added on top. That piece, Towards truly multilingual AI: breaking English dominance, was mostly about access, quality, and participation. Who gets served well. Who gets excluded. Who gets to build AI products for their own markets, in their own language, without going through English as the default bridge.

Last week, in Language is not neutral, I came back to the same problem from a more personal angle: what happens when you step inside a radically different language. Quechua requiring you to mark the source of a claim. Tzeltal anchoring space to terrain rather than to the observer. Aymara placing the past in front, because the past is visible.

This piece is about the layer underneath both arguments.

If language shapes how a system reasons, not just what it says, then building primarily in English is not a neutral technical choice. It is a conceptual one. In practice, English-dominant training is rarely only about language. It also carries cultural, institutional, and epistemic assumptions from the environments that produced most of the data. In 2026, with agents moving into operational, legal, administrative, and public-service contexts, that choice has costs that most teams are not measuring.

The reasoning layer starts before alignment

When language bias comes up in technical conversations, it usually lands on data distribution. English is overrepresented. Some languages are tokenized badly. Benchmarks are contaminated or translated too literally. All true. All important.

But the deeper problem starts earlier.

Pre-training is where the base model learns the statistical and conceptual structure of text. Not only vocabulary and syntax, but also assumptions about how knowledge is sourced, how time is structured, how space is organized, and how causality flows.

Post-training, instruction tuning, and RLHF can change behavior. They can make a model more helpful, more careful, more aligned with expected formats. They can surface capabilities and suppress bad habits. What they are much weaker at doing is changing the representational structure the model already learned during pre-training. The scaffolding is already there. You are working inside it, not replacing it.

A 2025 paper by researchers at MBZUAI makes this problem concrete. Using BICAUSE, a bilingual Chinese-English dataset for causal reasoning, the authors found that the same model, with the same weights, showed different internal attention patterns depending on the operating language. In Chinese, models tended to over-apply a language-specific prior that the initial phrase signals the cause. In English, attention was more balanced under structural variation. Their conclusion was not simply that models output different things in different languages. It was that they internalize language-specific reasoning biases.

There is a nuance here. The same paper also found that, when reasoning succeeds, hidden representations can converge toward a more language-agnostic abstraction space. That matters. It means models are not trapped at the surface of language. But the failure mode still matters. When the input deviates from the structure the model expects in that language, accuracy degrades. That is exactly the condition you are in when you deploy into a linguistic, cultural, or operational context your model was not deeply built for.

Some things English never forces you to say

Every language forces its speakers to track certain things automatically. English forces you to track tense. You usually have to locate an event in time: past, present, or future.

What English does not force you to track is the source of your knowledge.

“The service is healthy.”

That sentence is grammatically complete whether the agent observed the health check directly, received the status from a monitoring system, or inferred it from reduced error rates. The grammar does not care.

In an infrastructure context, the system absolutely should care. Observed state, reported state, and inferred state should not collapse into one representation. When something breaks, that distinction is not linguistic "decoration". It is operational truth.

Many languages make this distinction harder to ignore. Turkish, Quechua, Bulgarian, and others require speakers to mark evidentiality: whether something was witnessed, inferred, reported, or otherwise known. The distinction is not an optional adverb. It is part of the grammar.

A model trained mostly on English does not naturally inherit that pressure. You can teach it to say “according to the logs” or “it appears that.” But that does not mean the model is reliably tracking, through the reasoning process, whether a claim was observed, reported, or inferred. The surface can improve faster than the underlying representation.

A related gap shows up with time and process state. English organizes time heavily around when something happened: past, present, future. Other languages give more grammatical weight to aspect: whether an action is complete, ongoing, bounded, repeated, or still unfolding.

For an agent, this is not academic. A workflow can be completed, still running, partially complete, stalled, retried, or waiting on a dependency. If your evaluation suite is organized around the same tense-dominant assumptions as your model, you may not notice where it is weak. The failure will not look like bad translation. It will look like poor state tracking.

These are not edge cases for non-English users. They are structural features of how a model represents uncertainty, attribution, and process in any language it operates in.

Fine-tuning helps. It does not erase the substrate.

Post-training can do real work here. Instruction tuning can teach a model to hedge more consistently, cite sources, flag uncertainty, follow local communication norms, and produce more careful outputs. In many product contexts, that is useful. Sometimes it is enough.

But we should not confuse better behavior with deeper representation. A model can learn to produce the right phrase without reliably tracking the distinction underneath it. It can say “apparently” without having a robust internal representation of evidentiality. It can say “the task may still be running” without consistently reasoning about boundedness, partial completion, or process state.

This is one of the uncomfortable patterns we see in alignment more broadly: behavioral compliance without representational grounding.

Cross-lingual instruction tuning helps more when the data is authored natively in the target language rather than translated from English. Native-language prompts, locally written reasoning traces, and target-culture annotations can activate and reinforce multilingual capabilities that were already present in the base model. But the ceiling is still shaped by what was learned during pre-training. Post-training can move you closer to that ceiling. It does not fully rebuild the house.

Your multilingual evals may be testing translation, not reasoning

This is where the problem becomes practical. If your multilingual evaluation is built by translating English benchmarks, you are probably not testing your product in that language. You are testing how well your English assumptions survive translation.

That is not the same thing.

A 2025 Brown University and UC Berkeley paper on multilingual functional evaluation argues that static benchmarks such as Belebele, M-MMLU, and M-GSM do not adequately capture practical multilingual performance. The authors introduce functional benchmarks for instruction-following and find large drops compared with static evaluations, especially in languages such as Arabic and Yoruba.

That gap matters because production failures are rarely clean multiple-choice failures. They are failures of instruction following, ambiguity resolution, authority, politeness, refusal, escalation, evidence, and trust. They happen when the model has to operate inside a context, not just answer a question about it.

Translated benchmarks make this worse. They carry over assumptions from the source language: what counts as obvious, what needs to be stated, what order information should appear in, what kind of explanation is considered sufficient.

The fix is not glamorous: functional evaluation tasks, test cases authored natively in the target language, local annotators at critical points, native-language system prompts, and reasoning chains written in the language and culture where the product will operate. Slower, more expensive, less scalable on paper, and much closer to what actually breaks.

Sovereignty has to go deeper than infrastructure

This is where the sovereign AI conversation gets uncomfortable.

A country can own the infrastructure, fund the compute, deploy the model locally, and still inherit conceptual assumptions from an English-dominant pre-training substrate.

Europe is investing seriously in sovereign AI infrastructure. EuroHPC lists 19 AI Factories and 13 AI Factory Antennas across Europe. The EU’s AI Gigafactory plans target facilities with around 100,000 advanced AI chips each. That matters. Compute sovereignty is real. But compute sovereignty is not the same as conceptual sovereignty. Most of what is being built on top of that infrastructure starts from English-dominant pre-training pipelines and fine-tunes for local languages afterward. The conceptual grammar is inherited, not chosen.

India is asking a more interesting question. Sarvam released 30B and 105B open-source reasoning models built with Indian-language performance as a central objective rather than an afterthought. The models are not magic. No model is. But the direction matters: start from the linguistic reality you are trying to serve, not from the assumption that English is the universal substrate and everything else is a localization layer.

This is the question many sovereign AI initiatives are still not asking clearly enough. A model fine-tuned in French on top of an English-dominant foundation may produce good French. It may even be useful in many French public-sector contexts. Whether it reasons with French conceptual assumptions is a different question.

And if the answer is no, or only partially, then sovereignty is also partial.

You may own the deployment, control the data, host the weights; but the model may still have learned to think with someone else’s grammar.

Access is not the same as depth

The multilingual AI conversation has mostly been about access: more languages in the training data, more people able to use the system, better coverage, lower barriers. That argument stands, and it is unfinished. But access is the surface layer.

Underneath it sits a harder question: what can the system represent when it reasons? What distinctions does it track automatically? What does it flatten because the dominant language of pre-training never forced the distinction to exist?

The assumptions embedded at pre-training do not always show up as obviously wrong outputs. They show up as systematic gaps when the model enters situations the grammar it was built on never required it to handle.

Sovereign AI initiatives are multiplying. Evaluation practices are improving. Post-training is getting better. All of that matters a lot, but if the base model inherits its conceptual grammar from English, then sovereignty, multilingual coverage, and alignment improvements are all working inside a substrate someone else chose.

The deeper question is not only who owns the infrastructure, the data, or the deployment. It is whether the system learned to think with the world you are asking it to serve.