{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreib3lxhkoo464uezis7z6pa6p742ehg4on3dnd54rbozdpraukhzlq",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mok6qmjxkhb2"
},
"path": "/t/panetrans-drag-select-region-translation-ocr-on-video-canvas-built-on-transformers-js-local-by-default/176929#post_1",
"publishedAt": "2026-06-18T03:17:17.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Transformers.js",
"https://chromewebstore.google.com/detail/iienfgpjfginkjecmdfeakkfmlibfdcj"
],
"textContent": "Hi all — solo dev here, sharing a Chrome extension I’ve been building and would genuinely value this community’s engineering feedback on, since it lives entirely on Transformers.js + WebGPU.\n\n**The wedge (what it actually does differently)**\n\nIt’s _not_ a full-page translator — it doesn’t try to compete with Immersive Translate and friends. Two things drove the whole design.\n\n**Drag-select any region of a page** → you get a frosted-glass overlay pinned over that rectangle with the translation. The interesting part: when the underlying text _changes_ (live chat, a stock ticker, video captions), the overlay re-translates itself, so it works on things that are never static.\n\n**OCR mode for text the DOM can’t give you** — video frames, images, `<canvas>`, web-game / web-app UIs. Non-selectable text that every selection-based translator simply can’t touch.\n\nBy **default everything runs locally in the browser** — no account, no API key, nothing leaves the tab, and it works offline once the model is cached. (Honest caveat up front: first use downloads the model once, which is slow; after that it’s cached and offline. And there’s one optional cloud tier — see the note further down — that’s the _only_ path where text would leave the browser, and only if you deliberately switch to it.)\n\n**How it’s built (the parts I’d love eyes on)**\n\n**Inference lives in an MV3 offscreen document.** Content scripts can’t hold WebGPU / long-lived model state sanely, and the service worker gets killed, so the offscreen doc is the one persistent place to load the model and run inference; content script ↔ offscreen messaging carries the text in and the translation out.\n\n**Models, all via Transformers.js:** OPUS-MT (~160MB/pair) is the default — I run it **fp16 + greedy on WebGPU** , which in my testing is fast enough to _feel_ real-time for the live-region case (that’s an experiential claim, not a benchmark). NLLB-200 (~400MB) and M2M-100 are also available for better one-shot quality at lower speed. **dtype selection matters a lot here** : fp16 on WebGPU vs the q4 quants for the larger models is most of the speed/quality/VRAM tradeoff, and WASM fp32 is the fallback when WebGPU isn’t there.\n\n**The live-region pipeline is a settle/supersede loop.** A `MutationObserver` watches the captured region; instead of translating every mutation, I debounce until the DOM _settles_ , then fire. Because translations are async and can return out of order, each request carries a token and a newer request **supersedes** any in-flight older one — the stale result is dropped on arrival so a slow earlier translation can’t overwrite a fresh one. Getting this right (and not thrashing the model on chat that never stops mutating) was most of the work.\n\n**OCR hand-off:** Tesseract.js (local WASM) does recognition on a captured frame, and the recognized text is then handed to the same translation path as everything else — OCR and DOM text converge on one offscreen translate call. OCR is **not** 100%: stylized fonts and low-contrast frames fail, and I’d rather say that than oversell it.\n\nThere’s also a YouTube bilingual subtitle overlay built on the same plumbing.\n\n**Credit where it’s due:** none of this exists without Transformers.js and the work porting these models to run in-browser on ONNX Runtime Web. This community’s ports — OPUS-MT, NLLB, M2M-100, and the small LLMs — are the whole foundation.\n\nThere’s also an optional Pro tier with **local LLMs** (Qwen 3 0.6B, Gemma 3 1B, q4, WebGPU-only) for more context-aware output — though honestly they’re _not_ uniformly better than OPUS-MT, and they’re seconds-per-sentence, not real-time. And there is one optional **cloud tier** (a hosted LLM) for weak / no-GPU devices — **local is the default and the free path; nothing is sent to a server unless you deliberately switch to that tier.**\n\n**Where it’s at:** very early and honest about it — ~73 cumulative installs in the first month, single-digit weekly actives, 0 reviews. I’m not here for momentum, I’m here for the engineering critique. Languages currently: en / zh / ja / ko / fr / de / es / ru / pt / ar plus some non-English pairs.\n\nListing: https://chromewebstore.google.com/detail/iienfgpjfginkjecmdfeakkfmlibfdcj\n\n**Real questions I’m stuck on / curious about:**\n\n**Q1.** For the live-region case, is **fp16 + greedy OPUS-MT** the right default, or would a small quantized NLLB on WebGPU actually give better quality-per-millisecond once you account for the settle-debounce latency budget? Has anyone benchmarked OPUS-MT vs quantized NLLB-200 throughput on Transformers.js specifically?\n\n**Q2.** Is keeping the model resident in a single **MV3 offscreen document** the pattern people have settled on, or is there a better way to persist WebGPU + model state across MV3’s service-worker lifecycle that I’m missing?\n\n**Q3.** For OCR → translate, is there appetite for a Transformers.js-native OCR model (e.g. a TrOCR / Florence-style port) to replace Tesseract.js WASM, and has anyone gotten one fast enough for per-frame capture in-browser?\n\nThanks for reading — happy to go deeper on any of the internals.",
"title": "PaneTrans — drag-select region translation + OCR on video/canvas, built on Transformers.js (local by default)"
}