PaneTrans — drag-select region translation + OCR on video/canvas, built on Transformers.js (local by default)
Hi all — solo dev here, sharing a Chrome extension I’ve been building and would genuinely value this community’s engineering feedback on, since it lives entirely on Transformers.js + WebGPU.
The wedge (what it actually does differently)
It’s not a full-page translator — it doesn’t try to compete with Immersive Translate and friends. Two things drove the whole design.
Drag-select any region of a page → you get a frosted-glass overlay pinned over that rectangle with the translation. The interesting part: when the underlying text changes (live chat, a stock ticker, video captions), the overlay re-translates itself, so it works on things that are never static.
OCR mode for text the DOM can’t give you — video frames, images, <canvas>, web-game / web-app UIs. Non-selectable text that every selection-based translator simply can’t touch.
By default everything runs locally in the browser — no account, no API key, nothing leaves the tab, and it works offline once the model is cached. (Honest caveat up front: first use downloads the model once, which is slow; after that it’s cached and offline. And there’s one optional cloud tier — see the note further down — that’s the only path where text would leave the browser, and only if you deliberately switch to it.)
How it’s built (the parts I’d love eyes on)
Inference lives in an MV3 offscreen document. Content scripts can’t hold WebGPU / long-lived model state sanely, and the service worker gets killed, so the offscreen doc is the one persistent place to load the model and run inference; content script ↔ offscreen messaging carries the text in and the translation out.
Models, all via Transformers.js: OPUS-MT (160MB/pair) is the default — I run it fp16 + greedy on WebGPU , which in my testing is fast enough to feel real-time for the live-region case (that’s an experiential claim, not a benchmark). NLLB-200 (400MB) and M2M-100 are also available for better one-shot quality at lower speed. dtype selection matters a lot here : fp16 on WebGPU vs the q4 quants for the larger models is most of the speed/quality/VRAM tradeoff, and WASM fp32 is the fallback when WebGPU isn’t there.
The live-region pipeline is a settle/supersede loop. A MutationObserver watches the captured region; instead of translating every mutation, I debounce until the DOM settles , then fire. Because translations are async and can return out of order, each request carries a token and a newer request supersedes any in-flight older one — the stale result is dropped on arrival so a slow earlier translation can’t overwrite a fresh one. Getting this right (and not thrashing the model on chat that never stops mutating) was most of the work.
OCR hand-off: Tesseract.js (local WASM) does recognition on a captured frame, and the recognized text is then handed to the same translation path as everything else — OCR and DOM text converge on one offscreen translate call. OCR is not 100%: stylized fonts and low-contrast frames fail, and I’d rather say that than oversell it.
There’s also a YouTube bilingual subtitle overlay built on the same plumbing.
Credit where it’s due: none of this exists without Transformers.js and the work porting these models to run in-browser on ONNX Runtime Web. This community’s ports — OPUS-MT, NLLB, M2M-100, and the small LLMs — are the whole foundation.
There’s also an optional Pro tier with local LLMs (Qwen 3 0.6B, Gemma 3 1B, q4, WebGPU-only) for more context-aware output — though honestly they’re not uniformly better than OPUS-MT, and they’re seconds-per-sentence, not real-time. And there is one optional cloud tier (a hosted LLM) for weak / no-GPU devices — local is the default and the free path; nothing is sent to a server unless you deliberately switch to that tier.
Where it’s at: very early and honest about it — ~73 cumulative installs in the first month, single-digit weekly actives, 0 reviews. I’m not here for momentum, I’m here for the engineering critique. Languages currently: en / zh / ja / ko / fr / de / es / ru / pt / ar plus some non-English pairs.
Listing: https://chromewebstore.google.com/detail/iienfgpjfginkjecmdfeakkfmlibfdcj
Real questions I’m stuck on / curious about:
Q1. For the live-region case, is fp16 + greedy OPUS-MT the right default, or would a small quantized NLLB on WebGPU actually give better quality-per-millisecond once you account for the settle-debounce latency budget? Has anyone benchmarked OPUS-MT vs quantized NLLB-200 throughput on Transformers.js specifically?
Q2. Is keeping the model resident in a single MV3 offscreen document the pattern people have settled on, or is there a better way to persist WebGPU + model state across MV3’s service-worker lifecycle that I’m missing?
Q3. For OCR → translate, is there appetite for a Transformers.js-native OCR model (e.g. a TrOCR / Florence-style port) to replace Tesseract.js WASM, and has anyone gotten one fast enough for per-frame capture in-browser?
Thanks for reading — happy to go deeper on any of the internals.
Discussion in the ATmosphere