External Publication

Introducing Mediawiki Code2Code Search: Semantic search to find code by under-the-surface similarity

en.planet.wikimedia.org [Unofficial] April 14, 2026

The Telugu user interface of MediaWiki Code2Code Search

Have you ever tried to find a specific function in a MediaWiki extension but only vaguely remembered what it does, not its name? If so, a new tool might help. It’s called MediaWiki Code2Code Search, and it’s a bit different from what we’ve had before. It uses semantic search – think “search by meaning, not just by exact wording” – to help you find code snippets across our repositories by representing them as mathematical vectors.

It’s called “Code2Code” because you search with a piece of code to find other code: no need to write a natural language query. You paste a piece of code that captures some behaviour, and the tool attempts to find similar code across MediaWiki extensions and the core, even if the programming language, coding style, variable names or exact syntax differ.

This post is thus for MediaWiki extension developers, gadget authors, and anyone who navigates our shared codebase. Possible use cases include code maintenance, tracing code vulnerability and detecting plagiarism.

That’s the idea. Now let’s look under the hood.

Demo: searching for code similar to a Python recursive greatest common divisor implementation.

So what does it actually do?

Offline indexing (stages 1’–4′) meets online searching (stages 1–5).

The existing MediaWiki Codesearch (which won the 2019 Coolest Tool Award and is still relied on by many – with good reason!) is fast and reliable. It uses classic pattern-matching to find exact text. It’s like a very precise, very fast pair of eyes.

Conversely, the new MediaWiki Code2Code Search tries to understand what you mean, even if you don’t type the exact words or symbols. It turns code snippets into “embeddings” – mathematical vectors in a high‑dimensional space. Similar code ends up close together in the vector space. The background animation, with swirling coloured particles, is an artistic representation of this mathematical transformation: each dot represents a code embedding, but simplified for the eye. Close vectors are similar in meaning, yet distinct in origin.

Code2Code Search leverages a Jina AI bi-encoder model. Wikimedia Deutschland (WMDE) had a positive experience with Jina AI during the Wikidata Embedding Project, so I built on that. Since late 2025, WMF has also been developing hybrid search models (and, as of March 2026, is piloting them live on several Wikipedia editions), combining traditional keyword search with semantic-embedding-based retrieval.

Every component of MediaWiki Code2Code Search is open-source. The tool uses a single-stage retrieval strategy: the bi-encoder model from Jina AI (with 500 million parameters) computes offline vector representations for code snippets; then, the same model converts the query into a vector online, and a vector index finds the most relevant code snippets from the entire indexed codebase. The model runs within Toolforge’s default constraints and returns results ranked by semantic similarity.

Archiving the MediaWiki codebase to Software Heritage

Like dandelion seeds, open-source code can scatter and be lost. Archiving it gives it roots.

As Wikimedians, we know knowledge requires careful stewardship. Beyond just finding code, we must ensure the codebase remains traceable.

To this end, I have archived over 2,400 MediaWiki code repositories in Software Heritage (SWH). As a UNESCO-backed digital public good, SWH mirrors Wikimedia’s commitment to free, equitable knowledge. By providing a unified data model across forges (GitLab, GitHub, Gerrit, Bitbucket), SWH offers SWHIDs — intrinsic permalinks that ensure a specific line of code remains accessible forever, even if the original repository is moved or deleted.

Through this collaboration, we are helping to preserve software as a transparent and inclusive public good. For more on SWH’s vision, you can read my February Diff post regarding their 10-year journey at UNESCO.

Polyglot UI: why the interface speaks Telugu, Italian and French

But let’s talk about the frontend! When you open the tool, the first thing you’ll notice is the user interface (UI). Why does it speak Telugu, Italian, and French (and 13 other Indic languages)? Ok, let me tell you a story.

How come Italian? Well, that’s my native language. But there’s a deeper curiosity. In the 15th century, the Venetian explorer Nicolò de’ Conti noticed that Telugu words often end with vowels, just like Italian. Thus, he called Telugu the “Italian of the East”. So (perhaps!) Italian is the “Telugu of the West”.

The 15th-century Venetian map inspired by Nicolò de’ Conti’s explorations (south at the top!) showing South Asian marvels in Taprobana (Sri Lanka), Bangala (Bengal), Odișa (Odisha), and the Andaman isles.

Just as mediaeval maps (like the one by Fra Mauro, in Venice) brought distant worlds onto the same parchment, this tool tries to bring developers across natural and programming languages closer together.

(The colours of the interface are just vector geometry; unrelated to Italy and India, though…)

French is a tribute to the birthplace of the SWH archive: Paris.

So the interface is translated because code development should reflect the diversity of our software communities and not require English fluency.

Try it, and share your thoughts.

Try it out: https://code2codesearch.toolforge.org/

Also, check out the page on MediaWiki. The tool has just been released and indexes 1,100,000+ code snippets from 83,000+ source code files across 2,400+ source repositories. The index runs on FAISS, a library from Facebook Research (now Meta); for every single result, code permalinks are served via Software Heritage (SWH).

There is always headroom for improvement. One main limitation is that the index does not receive dynamic updates when new code versions are committed to the original repositories; in its current version, MediaWiki Code2Code Search indexes the code versions in main branches as they appeared at the beginning of April 2026. Further, some uncommon or rare programming languages may be missing. Currently supported languages are Python, C++, C, PHP, JavaScript, TypeScript, Lua, Go, Java, and Rust.

If you try it and find something odd, or brilliant, or both, please get in touch! Open an issue in the GitHub repository, leave a message on my User talk page on Meta-Wiki, or reply to this post on the Diff comments section. Try searching for a function you wrote, say, six months ago and forgot the name of. Paste a 5-line snippet. Then tell me: did it find what you were looking for? I’d love to hear from fellow Wikimedians, especially technicians, those working on Indic-language projects, or anyone who has ever struggled to find the right piece of code. Some members of the Indic tech community are suggesting a transition to a simpler, more accessible Codex UI instead of the modern-looking Three.js + React stack: let’s talk about this!

In this project, I partially contributed during my working hours in my capacity as a researcher at Sant’Anna School of Advanced Studies, Pisa; in this capacity, I am funded by the Alfred P. Sloan Foundation with grant #G-2025-25193 (sloan.org). I thank my colleagues in Pisa and all Wikimedians in Telegram groups such as Indic Wikimedia technical forum , Wikidata Help , Wikimedia Hackathon , and Comunità tecnica Wikimedia in Italiano for insightful discussions and comments on this work.

The new MediaWiki Code2Code Search complements the existing Codesearch rather than replacing it. The old Codesearch is still there, and it’s still great. But if you ever wished you could search for what code does, not just what it says – give this a spin. The tool is live. Everything is open (released under Apache Software License 2.0). And the vector dots are waiting for you.

So what does it actually do?

Archiving the MediaWiki codebase to Software Heritage

Polyglot UI: why the interface speaks Telugu, Italian and French

Try it, and share your thoughts.

Discussion in the ATmosphere