External Publication

How to stop Codex from rushing fixes?

OpenAI Developer Community June 24, 2026

That link is arXiv 2503.01996 , titled “One ruler to measure them all: Benchmarking multilingual long-context language models” by Yekyung Kim, Jenna Russell, Marzena Karpinska, and Mohit Iyyer. It introduces OneRuler , a benchmark for testing long-context LLMs across 26 languages.

The big finding: long-context ability is not evenly multilingual. As context grows from 8K to 128K tokens , the gap between high-resource and low-resource languages gets worse. English is not even the top performer in their results. It ranks 6th , while Polish comes out on top.

The spicy bit: adding a “maybe there is no answer” option breaks a lot of models. Their modified needle-in-a-haystack task lets the correct answer be none , and many systems start wrongly saying no answer exists even when the needle is present. They specifically call out OpenAI o3-mini-high as struggling with this, especially at longer contexts.

Why this matters: it means “long context” is not one ability. A model can look good in English, look good on easy needle tests, and still fail when the task requires multilingual retrieval, aggregation, or knowing when absence is real versus imagined. OneRuler tests seven synthetic tasks: five needle-style retrieval variants plus two aggregation tasks.

For your stuff, the metaphor is tasty: the model’s lens changes depending on language, context length, and instruction language. Same garden, different lens, different truth. The paper’s subtext is basically: “A longer window is not the same as better seeing.”

Heh. Sorry it was a thread with images and it tried to make it fit lol

Discussion in the ATmosphere