Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihvvz4nhacfsjych5ue2vwsh2scsbsnqdm6sm7ozuzpmakmqw2vsa",
    "uri": "at://did:plc:jo3wjj2gx46alocis4wubmwr/app.bsky.feed.post/3mjhfxm3ms2q2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreifddyx6tmhcpgjjgxx5wnh3igze53lpnwfrmbvyxi2qgathuh5dbu"
    },
    "mimeType": "image/png",
    "size": 428577
  },
  "path": "/2026/04/14/introducing-mediawiki-code2code-search-semantic-search-to-find-code-by-under-the-surface-similarity/",
  "publishedAt": "2026-04-14T11:00:00.000Z",
  "site": "https://diff.wikimedia.org",
  "tags": [
    "MediaWiki\nCodesearch",
    "MediaWiki Code2Code Search",
    "Jina AI\nbi-encoder model",
    "Wikidata Embedding Project",
    "hybrid search models",
    "Software Heritage",
    "SWHIDs",
    "my February Diff post regarding their 10-year journey at\nUNESCO",
    "he called Telugu the “Italian of the East”",
    "the one by Fra Mauro, in Venice",
    "Italy",
    "India",
    "https://code2codesearch.toolforge.org/",
    "page on MediaWiki",
    "FAISS",
    "Open\nan issue in the GitHub repository",
    "User talk\npage on Meta-Wiki",
    "Codex UI",
    "G-2025-25193"
  ],
  "textContent": "The Telugu user interface of MediaWiki Code2Code Search\n\nHave you ever tried to find a specific function in a MediaWiki extension but only vaguely remembered what it does, not its name? If so, a new tool might help. It’s called MediaWiki Code2Code Search, and it’s a bit different from what we’ve had before. It uses semantic search – think “search by meaning, not just by exact wording” – to help you find code snippets across our repositories by representing them as mathematical vectors.\n\nIt’s called “Code2Code” because you search with a piece of code to find other code: no need to write a natural language query. You paste a piece of code that captures some behaviour, and the tool attempts to find similar code across MediaWiki extensions and the core, even if the programming language, coding style, variable names or exact syntax differ.\n\nThis post is thus for MediaWiki extension developers, gadget authors, and anyone who navigates our shared codebase. Possible use cases include code maintenance, tracing code vulnerability and detecting plagiarism.\n\nThat’s the idea. Now let’s look under the hood.\n\nDemo: searching for code similar to a Python recursive greatest common divisor implementation.\n\n## So what does it actually do?\n\nOffline indexing (stages 1’–4′) meets online searching (stages 1–5).\n\nThe existing MediaWiki\nCodesearch (which won the 2019 Coolest Tool Award and is still relied on by many – with good reason!) is fast and reliable. It uses classic pattern-matching to find exact text. It’s like a very precise, very fast pair of eyes.\n\nConversely, the new \nMediaWiki Code2Code Search tries to understand what you mean, even if you don’t type the exact words or symbols. It turns code snippets into “embeddings” – mathematical vectors in a high‑dimensional space. Similar code ends up close together in the vector space. The background animation, with swirling coloured particles, is an artistic representation of this mathematical transformation: each dot represents a code embedding, but simplified for the eye. Close vectors are similar in meaning, yet distinct in origin.\n\nCode2Code Search leverages a Jina AI\nbi-encoder model. Wikimedia Deutschland (WMDE) had a positive experience with Jina AI during the \nWikidata Embedding Project, so I built on that. Since late 2025, WMF has also been developing \nhybrid search models (and, as of March 2026, is piloting them live on several Wikipedia editions), combining traditional keyword search with semantic-embedding-based retrieval.\n\nEvery component of MediaWiki Code2Code Search is open-source. The tool uses a single-stage retrieval strategy: the bi-encoder model from Jina AI (with 500 million parameters) computes offline vector representations for code snippets; then, the same model converts the query into a vector online, and a vector index finds the most relevant code snippets from the entire indexed codebase. The model runs within Toolforge’s default constraints and returns results ranked by semantic similarity.\n\n## Archiving the MediaWiki codebase to Software Heritage\n\nLike dandelion seeds, open-source code can scatter and be lost. Archiving it gives it roots.\n\nAs Wikimedians, we know knowledge requires careful stewardship. Beyond just finding code, we must ensure the codebase remains traceable.\n\nTo this end, I have archived over 2,400 MediaWiki code repositories in \nSoftware Heritage (SWH). As a UNESCO-backed digital public good, SWH mirrors Wikimedia’s commitment to free, equitable knowledge. By providing a unified data model across forges (GitLab, GitHub, Gerrit, Bitbucket), SWH offers \nSWHIDs — intrinsic permalinks that ensure a specific line of code remains accessible forever, even if the original repository is moved or deleted.\n\nThrough this collaboration, we are helping to preserve software as a transparent and inclusive public good. For more on SWH’s vision, you can read \nmy February Diff post regarding their 10-year journey at\nUNESCO.\n\n## Polyglot UI: why the interface speaks Telugu, Italian and French\n\nBut let’s talk about the frontend! When you open the tool, the first thing you’ll notice is the user interface (UI). Why does it speak Telugu, Italian, and French (and 13 other Indic languages)? Ok, let me tell you a story.\n\nHow come Italian? Well, that’s my native language. But there’s a deeper curiosity. In the 15th century, the Venetian explorer Nicolò de’ Conti noticed that Telugu words often end with vowels, just like Italian. Thus, \nhe called Telugu the “Italian of the East”. So (perhaps!) Italian is the “Telugu of the West”.\n\nThe 15th-century Venetian map inspired by Nicolò de’ Conti’s explorations (south at the top!) showing South Asian marvels in Taprobana (Sri Lanka), Bangala (Bengal), Odișa (Odisha), and the Andaman isles.\n\nJust as mediaeval maps (like \nthe one by Fra Mauro, in Venice) brought distant worlds onto the same parchment, this tool tries to bring developers across natural and programming languages closer together.\n\n(The colours of the interface are just vector geometry; unrelated to Italy and India, though…)\n\nFrench is a tribute to the birthplace of the SWH archive: Paris.\n\nSo the interface is translated because code development should reflect the diversity of our software communities and not require English fluency.\n\n## Try it, and share your thoughts.\n\nTry it out: https://code2codesearch.toolforge.org/\n\nAlso, check out the \npage on MediaWiki. The tool has just been released and indexes 1,100,000+ code snippets from 83,000+ source code files across 2,400+ source repositories. The index runs on FAISS, a library from Facebook Research (now Meta); for every single result, code permalinks are served via Software Heritage (SWH).\n\nThere is always headroom for improvement. One main limitation is that the index does not receive dynamic updates when new code versions are committed to the original repositories; in its current version, MediaWiki Code2Code Search indexes the code versions in main branches as they appeared at the beginning of April 2026. Further, some uncommon or rare programming languages may be missing. Currently supported languages are Python, C++, C, PHP, JavaScript, TypeScript, Lua, Go, Java, and Rust.\n\nIf you try it and find something odd, or brilliant, or both, please get in touch! Open\nan issue in the GitHub repository, leave a message on my User talk\npage on Meta-Wiki, or reply to this post on the Diff comments section. Try searching for a function you wrote, say, six months ago and forgot the name of. Paste a 5-line snippet. Then tell me: did it find what you were looking for? I’d love to hear from fellow Wikimedians, especially technicians, those working on Indic-language projects, or anyone who has ever struggled to find the right piece of code. Some members of the Indic tech community are suggesting a transition to a simpler, more accessible Codex UI instead of the modern-looking Three.js + React stack: let’s talk about this!\n\nIn this project, I partially contributed during my working hours in my capacity as a researcher at Sant’Anna School of Advanced Studies, Pisa; in this capacity, I am funded by the Alfred P. Sloan Foundation with grant #G-2025-25193 (sloan.org). I thank my colleagues in Pisa and all Wikimedians in Telegram groups such as _Indic Wikimedia technical forum_ , _Wikidata Help_ , _Wikimedia Hackathon_ , and _Comunità tecnica Wikimedia in Italiano_ for insightful discussions and comments on this work.\n\nThe new MediaWiki Code2Code Search complements the existing Codesearch rather than replacing it. The old Codesearch is still there, and it’s still great. But if you ever wished you could search for what code does, not just what it says – give this a spin.\nThe tool is live. Everything is open (released under Apache Software License 2.0). And the vector dots are waiting for you.",
  "title": "Introducing Mediawiki Code2Code Search: Semantic search to find\ncode by under-the-surface similarity"
}