Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreigzzvzwy25eqgpughfu4ypudlczxfxwespr4rtnxtbdka3sxw2nua",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3momvysuuzra2"
  },
  "path": "/t/panetrans-drag-select-region-translation-ocr-on-video-canvas-built-on-transformers-js-local-by-default/176929#post_2",
  "publishedAt": "2026-06-19T08:13:16.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "How to Use Transformers.js in a Chrome Extension",
    "Tesseract.js performance docs",
    "API docs",
    "NLLB-200 distilled 600M model card",
    "NLLB paper",
    "OPUS-MT project",
    "Democratizing Machine Translation with OPUS-MT",
    "Transformers.js issue #1317",
    "Transformers.js issue #1518",
    "The extension service worker lifecycle",
    "Offscreen API",
    "Gemma Gem",
    "Tesseract.js API docs",
    "workers_vs_schedulers.md",
    "chrome.tabs API",
    "chrome.tabCapture API",
    "Use cross-origin images in a canvas",
    "Transformers.js issue #1385: Add VideoFrame support to RawImage",
    "Transformers.js issue #1248: Chrome extension fails due to remotely hosted code",
    "Deal with remote hosted code violations",
    "Transformers.js env API",
    "Transformers.js custom usage",
    "Transformers.js dtype guide",
    "Limited Use policy",
    "User data FAQ"
  ],
  "textContent": "I’m not very familiar with JavaScript, but after doing a bit of digging, my rough understanding is:\n\n* * *\n\nThis looks like a reasonable direction to me. I would probably frame PaneTrans less as a pure “which translation model is best?” question and more as a **live browser-extension pipeline** question.\n\nMy rough answer to the three questions would be:\n\nTopic | My rough take\n---|---\n**OPUS-MT vs quantized NLLB** | I would keep **OPUS-MT fp16 + greedy** as the default live/hot path for now, and treat **NLLB** as an opt-in coverage/quality path unless current Transformers.js v4 benchmarks show otherwise.\n**Offscreen document vs service worker** | I would not call the offscreen-document approach obsolete. The newer Hugging Face guide, How to Use Transformers.js in a Chrome Extension, makes the background service-worker model-host pattern worth comparing against, but PaneTrans has a more latency-sensitive OCR/overlay loop than a normal assistant extension.\n**Transformers.js-native OCR** | Interesting later, but I would first harden the OCR/capture scheduling layer before replacing Tesseract.js. The Tesseract.js performance docs and API docs are useful references for worker reuse and schedulers.\n**Most valuable next work** | Probably pipeline hardening: ROI-only OCR, stale-job cancellation, bounded OCR queue, cache/offline state, extension-local JS/WASM packaging, and a clear local/cloud privacy story.\n\n## 1. I would separate the problem into layers\n\nThe project becomes easier to reason about if the architecture is split into layers:\n\nLayer | Main question\n---|---\n**Text source layer** | Is the text available from the DOM, or does it require visual capture/OCR?\n**Capture layer** | Is this a static screenshot-like case, or a continuous video/canvas/game/subtitle case?\n**OCR layer** | How often should OCR run, and how are old OCR jobs discarded?\n**Translation layer** | Which translation model is good enough for low-latency live updates?\n**Runtime layer** | Should warm OCR/model state live in the service worker, an offscreen document, or a hybrid?\n**Cache/offline layer** | What happens on first run, offline use, broken cache, or model update?\n**Store/privacy layer** | Can the data flow be explained as local-first, user-triggered, and selected-region-only?\n\nThat framing prevents the model choice from carrying too much architectural weight. For live overlays, the visible user experience depends on the whole chain:\n\n\n    source text / visual region changes\n    → DOM extraction or visual capture\n    → OCR, if needed\n    → translation\n    → stale-result filtering\n    → overlay update\n\n\nSo I would measure and optimize the **end-to-end overlay path** , not only model throughput.\n\n## 2. Q1: OPUS-MT vs quantized NLLB\n\nI could not find a clean public benchmark of **OPUS-MT vs quantized NLLB-200 specifically on current Transformers.js v4/WebGPU**. So I would avoid making a global claim like “OPUS-MT is better” or “NLLB is better.”\n\nThe indirect evidence suggests a split design.\n\nRole | OPUS-MT / Marian-style models | NLLB-200 distilled\n---|---|---\nNatural role | Default live hot path | Opt-in coverage/quality mode\nModel shape | Pair-specific translation models | Broad multilingual model\nStrength | Smaller, simpler, easier to reason about per language pair | Wide coverage, especially useful for low-resource language support\nBrowser risk | Still needs testing, but likely easier to keep lightweight | More likely to stress load time, memory, dtype/backend stability\nBest decision criterion | “Good enough quality + low latency” | “Extra coverage/quality worth the cost”\n\nFor NLLB, the main argument is not necessarily browser live latency. It is multilingual coverage. The NLLB-200 distilled 600M model card and the NLLB paper make the coverage and low-resource language story clear. That is valuable, but it is a different optimization target from rapidly changing chat/ticker/subtitle overlays.\n\nFor OPUS-MT, the OPUS-MT project and the OPUS/Marian ecosystem are more pair-specific. That makes it easier to choose and evaluate per language pair. The OPUS-MT paper also discusses compact and speed-oriented translation models for real-time use cases: Democratizing Machine Translation with OPUS-MT.\n\nThere are also some caution signs around quantized decoder paths in browser/WebGPU settings. For example, Transformers.js issue #1317 reports q8 decoder output problems on WebGPU, and Transformers.js issue #1518 reports a WebGPU translation crash path around NLLB/translation.\n\nThose issues do not prove that NLLB is a bad choice. They only suggest that quantized NLLB/WebGPU should be benchmarked as a real pipeline before becoming the default hot path.\n\nSo my rough recommendation would be:\n\n> Keep OPUS-MT fp16 + greedy as the default live-region path. Treat NLLB as an opt-in coverage/quality path, especially for language pairs where OPUS-MT quality or availability is not good enough. Promote NLLB to the hot path only after a current Transformers.js v4 benchmark shows it is faster, stable, and good enough for the exact language pairs.\n\n### Benchmarking Q1\n\nI would not only measure `tokens/sec`. A live overlay benchmark probably needs:\n\nMetric | Why it matters\n---|---\n**Cold load time** | First-use friction\n**Download size** | User adoption and cache reliability\n**Warm p50 latency** | Normal overlay responsiveness\n**Warm p95 latency** | Visible lag spikes\n**Memory / GPU failure rate** | Browser/WebGPU stability\n**Stale-result rate** | Whether old translations overwrite newer text\n**Language-pair quality** | Whether NLLB’s coverage advantage matters for the target pair\n**Offline restart test** | Whether cached models actually work when offline\n\nA practical benchmark set might look like:\n\n\n    Models:\n    - OPUS-MT fp16 WebGPU, where available\n    - OPUS-MT q8/q4/WASM, if useful and available\n    - NLLB q8/q4/q4f16/WebGPU, if available\n    - Optional: cloud/API baseline only for quality comparison, not default local path\n\n    Inputs:\n    - 20 short chat-like strings\n    - 20 ticker/UI-like strings\n    - 20 subtitle-like strings\n    - 5 language pairs, including at least one low-resource pair if that is a target\n\n    Report:\n    - cold load\n    - warm p50/p95\n    - memory\n    - failures\n    - visible update latency\n    - translation quality notes\n\n\nThe important number is probably not model throughput alone, but:\n\n\n    source text changes\n    → OCR/DOM path notices it\n    → translation runs\n    → overlay updates\n\n\nThat is the user-visible latency.\n\n## 3. Q2: offscreen document vs service worker\n\nI would separate two questions:\n\n  1. **Can WebGPU run in a service worker now?**\n  2. **Is the MV3 service worker the best lifecycle container for a warm, resident OCR/translation engine?**\n\n\n\nThe first answer is increasingly “yes.” Hugging Face now has a very relevant guide: How to Use Transformers.js in a Chrome Extension. That guide is explicitly about using Transformers.js under Chrome Manifest V3 constraints, and it describes an architecture with:\n\n  * a **background service worker** that hosts models,\n  * a **side panel** for chat UI,\n  * a **content script** for page-level actions.\n\n\n\nThat is now a strong reference architecture for Transformers.js Chrome extensions.\n\nBut the second question is more subtle. Chrome extension service workers are not persistent Manifest V2 background pages. The Chrome docs explain that extension service workers can be terminated after inactivity or long-running work: The extension service worker lifecycle.\n\nSo I would not read the HF guide as “offscreen is obsolete.” I would read it as:\n\n> The service-worker-hosted model pattern is now a serious reference architecture for Transformers.js Chrome extensions.\n\nPaneTrans has a different hot path from a normal assistant extension. A chat assistant can often tolerate recoverable model state. A live overlay translator may care more about warm latency, OCR loops, frame scheduling, long-running visual capture, and avoiding stale overlay updates.\n\nThe offscreen document still seems defensible for this kind of workload, especially if it is used as a dedicated helper for DOM/canvas/media/OCR/model work. Chrome’s Offscreen API describes it as a hidden document for cases where a service worker needs DOM-like capabilities. There is also a related example where the offscreen document hosts a Transformers.js/WebGPU model while the service worker routes messages: Gemma Gem.\n\nI would phrase the architecture choice as a comparison, not a correction:\n\nOption | Good for | Main caution\n---|---|---\n**Service worker model host** | Aligns with the HF extension guide; good central coordinator | Must treat model state as recoverable; lifecycle can interrupt warm state\n**Offscreen model/OCR helper** | Better fit for resident media/OCR/canvas-style work | Should not become an unstructured “second background page”; keep message boundaries clear\n**Hybrid** | Service worker handles permissions/routing/cache state; offscreen handles capture/OCR/model loop | More moving parts, but likely clearer for PaneTrans-like workloads\n\nIf the current implementation already has a clean offscreen model/OCR worker, I would treat this as hardening rather than a rewrite.\n\nIf the current implementation puts most OCR/capture/model work directly in the content script, or relies on frequent screenshots, then moving toward a clearer service-worker/offscreen split may be a larger change.\n\nA possible responsibility split:\n\nComponent | Possible responsibility\n---|---\n**Content script** | Region selection, DOM text extraction, overlay positioning, MutationObserver\n**Service worker** | Permissions, tab routing, tabCapture stream ID, cache/model status, mode switching\n**Offscreen document** | Media/canvas processing, OCR worker pool, optional warm model host\n**Side panel / popup** | Language settings, local/cloud mode, cache status, download progress, privacy state\n\nThis is not the only architecture, but it is a useful mental model.\n\n## 4. Q3: OCR direction\n\nFor OCR, I would first optimize the scheduling layer before replacing the OCR engine.\n\nTesseract.js is not necessarily glamorous, but it is a practical browser OCR baseline. The important part is not to call OCR too often and not to create/destroy workers unnecessarily. The Tesseract.js performance docs warn against arbitrary worker creation and recommend reusing workers or using a bounded scheduler/pool for parallel work. The scheduler API is documented here: Tesseract.js API docs. A conceptual comparison of workers and schedulers is also in workers_vs_schedulers.md.\n\nFor PaneTrans, I would probably think about the OCR loop like this:\n\n\n    selected ROI\n    → crop\n    → downscale if useful\n    → cheap frame diff / visual-change check\n    → debounce\n    → bounded OCR queue\n    → Tesseract worker or scheduler\n    → OCR text changed?\n    → translation\n    → discard stale result if a newer ROI/text version exists\n\n\nThe key point is: do not OCR every frame just because the frame exists.\n\nA practical OCR hardening list:\n\nOCR hardening item | Why it helps\n---|---\n**ROI-only OCR** | Less work, clearer privacy story\n**Frame diff before OCR** | Avoids repeated OCR on unchanged regions\n**Debounce** | Avoids OCR during visual transitions\n**Bounded queue** | Prevents backlog when frames change faster than OCR completes\n**Sequence IDs** | Prevents old OCR/translation jobs from overwriting newer text\n**Worker reuse** | Avoids repeated Tesseract startup cost\n**Worker recycle** | Useful for long-running sessions if memory grows\n**Language-specific OCR config** | Avoids unnecessary OCR language packs\n\nTransformers.js-native OCR is still interesting. For example, TrOCR-like or PaddleOCR-like routes could become useful later. But I would not expect them to be a drop-in replacement for live video/canvas OCR unless detection, layout, cropping, size, initialization time, and browser memory are also handled.\n\nSo I would answer Q3 as:\n\n> Yes, I would be interested in Transformers.js-native OCR options, but I would first make the Tesseract/capture pipeline boringly reliable. The immediate wins are probably ROI scheduling, stale-job cancellation, worker reuse, and bounded queues, not swapping the OCR model.\n\n## 5. Capture/input pipeline\n\nI would probably separate input paths rather than treat all text sources the same:\n\nPath | Use case | Notes\n---|---|---\n**DOM text path** | Normal web pages, chat, tickers, dynamic DOM text | Best first path: fast, low permission burden, no OCR\n**Static screenshot path** | Occasional image/canvas text | Good fallback, but not ideal as a high-frequency loop\n**tabCapture/offscreen path** | Video, games, subtitles, frequently changing canvas | More complex, but likely better for continuous visual text\n\nFor screenshots, `captureVisibleTab()` can be useful, but I would avoid making it the high-frequency OCR loop. Chrome documents a maximum captureVisibleTab call rate, exposed as `MAX_CAPTURE_VISIBLE_TAB_CALLS_PER_SECOND`, in the chrome.tabs API.\n\nFor continuous media-like cases, `tabCapture` is probably worth considering. The chrome.tabCapture API covers capturing the current tab’s media stream, with user invocation constraints.\n\nA possible layered approach:\n\n\n    DOM text available?\n      → use DOM text path\n\n    visual static region?\n      → use low-frequency screenshot OCR fallback\n\n    continuous video/canvas/game/subtitle region?\n      → consider tabCapture + offscreen document\n\n\nThis also helps with Chrome Web Store review and user trust: process the selected region, not the whole page, and make the data flow easy to explain.\n\n## 6. Canvas/video edge cases\n\nOne easy trap is assuming that because something is visible, it is always directly readable as pixels from the page.\n\nFor example, if a page canvas contains cross-origin images or video without the right CORS conditions, the canvas can become tainted. Then APIs such as `getImageData()`, `toBlob()`, or `toDataURL()` may fail. MDN has a clear explanation here: Use cross-origin images in a canvas.\n\nSo I would design the visual-input path with fallback tiers:\n\n\n    1. DOM text path\n    2. direct element/canvas/image path, if safely readable\n    3. selected-region screenshot fallback\n    4. tabCapture/offscreen path for continuous visual regions\n\n\nFor video/canvas performance, another future optimization is avoiding unnecessary full-frame CPU readback. There is a related Transformers.js discussion/request around `VideoFrame` / WebCodecs / avoiding `getImageData()`-style paths here: Transformers.js issue #1385: Add VideoFrame support to RawImage.\n\nI would not make that a first rewrite requirement, but I would keep it in mind if the video OCR path becomes a bottleneck.\n\n## 7. Cache/offline/packaging reliability\n\nFor a local-first extension, I would separate three things:\n\n  1. **Model weights**\n  2. **Transformers.js runtime code**\n  3. **ONNX Runtime WASM/factory files**\n\n\n\nThe model weights can be downloaded and cached. But extension runtime code and WASM helper files need special care under Manifest V3.\n\nThere is a relevant Transformers.js issue where a Chrome extension hit remote-hosted-code problems while trying to load ORT helper files from a CDN: Transformers.js issue #1248: Chrome extension fails due to remotely hosted code. Chrome’s own documentation on remote hosted code explains why this matters for MV3 review: Deal with remote hosted code violations.\n\nSo I would make cache/offline state explicit in the UI:\n\nState | UI meaning\n---|---\n**Not downloaded** | First run required\n**Downloading** | Show progress\n**Cached** | Model files available\n**Offline-ready** | Model + runtime/WASM path usable without network\n**Missing file** | Cache incomplete\n**Redownload needed** | Version/cache mismatch\n**Clear cache** | User recovery path\n\nThe Transformers.js environment settings are also worth reviewing: Transformers.js env API. Depending on the final packaging, settings such as local model paths, remote model allowance, browser/WASM cache, custom fetch, and cache keys may matter. The general custom model usage docs are also relevant: Transformers.js custom usage. For dtype selection, the Transformers.js dtype guide is worth checking because available dtypes are model/repo dependent rather than universal.\n\nA conservative cache/offline design could be:\n\n\n    First run:\n      - show model choice\n      - show approximate download size\n      - download with progress\n      - verify required files are present\n      - run a tiny smoke test\n\n    Offline mode:\n      - do not silently attempt remote fetch\n      - show whether each model is offline-ready\n      - if a file is missing, explain that first download is required\n\n    Recovery:\n      - clear/redownload model cache\n      - version cache keys by model/revision\n      - handle partial cache or failed download visibly\n\n\nIn the HF Chrome-extension demo repo, the `ModelRegistry` style of checking needed files, metadata, size, and cache state is useful as a pattern. Even if PaneTrans does not copy that exact implementation, the product behavior is good: the user should know what is downloaded, what is cached, and what works offline.\n\n## 8. Chrome Web Store / privacy story\n\nI would treat the store/privacy story as part of the architecture, not as final paperwork.\n\nPaneTrans may touch page text, screenshots, OCR output, and translated text. Chrome’s Limited Use policy says web browsing activity collection/use should be limited to the user-facing feature and not repurposed. The User data FAQ also explains privacy policy and sensitive data expectations.\n\nA privacy-friendly implementation also helps performance:\n\nPrivacy-friendly choice | Performance benefit\n---|---\n**Process selected ROI only** | Less OCR work\n**Local-first by default** | Lower latency and clearer trust model\n**Do not store page screenshots** | Less storage/cache complexity\n**Drop intermediate frames** | Lower memory pressure\n**Separate local/cloud mode** | Clearer UX and easier policy explanation\n**Optional permissions where possible** | Lower permission friction\n\nI would probably describe the intended data flow explicitly:\n\n\n    - OCR runs only on the selected region.\n    - Local mode does not send page images/text to a server.\n    - Cloud mode, if enabled, is separate and explicit.\n    - Intermediate screenshots/frames are not retained.\n    - Model cache stores model/runtime files, not page content.\n\n\nThis is not just a policy matter. It is also a useful architecture constraint.\n\n## 9. Suggested hardening checklist\n\nIf I were hardening PaneTrans, I would probably prioritize this order:\n\nPriority | Item | Why\n---|---|---\n1 | Keep DOM text extraction as the first path | Fastest, least fragile, least permission-heavy\n2 | OCR only the selected ROI | Lower cost, clearer privacy story\n3 | Add sequence IDs to OCR and translation jobs | Prevent stale result overwrites\n4 | Drop stale OCR/translation results | Critical for live regions\n5 | Add debounce and frame-diff before OCR | Avoid unnecessary OCR\n6 | Reuse Tesseract workers | Avoid repeated initialization\n7 | Use a bounded scheduler if parallelism is needed | Avoid unbounded memory growth\n8 | Treat `captureVisibleTab()` as low-frequency fallback | Avoid high-frequency screenshot loop\n9 | Consider `tabCapture + offscreen` for continuous regions | Better fit for video/canvas/subtitle cases\n10 | Show model download/cache/offline status | Better first-run and failure recovery\n11 | Bundle/configure JS/WASM runtime files in an MV3-friendly way | Avoid remote-hosted-code issues\n12 | Keep local/cloud mode and privacy disclosure explicit | Easier user trust and store review\n13 | Benchmark OPUS-MT and NLLB as end-to-end overlay paths | Measures the real UX, not just model speed\n\n## 10. Suggested benchmark matrix\n\nFor the model/runtime question, I would not only measure tokens/sec.\n\nA more useful validation matrix might be:\n\nBenchmark item | Why\n---|---\n**Cold model load time** | First-use experience\n**Warm p50 latency** | Normal live overlay feel\n**Warm p95 latency** | Lag spikes\n**Download size** | User friction\n**Memory usage** | WebGPU/browser stability\n**OCR calls/min** | Whether scheduling is working\n**Stale-result drop rate** | Whether old results are safely discarded\n**Translation quality by language pair** | Whether NLLB’s coverage advantage matters\n**Offline restart test** | Whether cache/offline actually works\n**Long-session test** | Whether OCR workers/memory stay healthy\n**Store-review readiness** | Whether permissions/data flow are explainable\n\nI would especially measure “visible update latency”:\n\n\n    source text changes\n    → DOM/capture/OCR detects it\n    → translation runs\n    → stale result check passes\n    → overlay updates\n\n\nThat seems more relevant than model throughput alone.\n\n## 11. If the current implementation already does X…\n\nBecause I do not know the internal implementation, I would phrase implementation advice conditionally.\n\nIf current PaneTrans already does this… | Then I would treat the next work as…\n---|---\nContent script only handles selection/overlay/DOM extraction | Mostly hardening\nOffscreen document already hosts OCR/model work | Mostly hardening\nService worker already handles routing/permissions/cache state | Mostly hardening\nOCR already runs only on selected ROI | Mostly hardening\nModel/cache status is already visible | Mostly UX refinement\n\nIf current PaneTrans instead does this… | Then the change may be larger\n---|---\nContent script owns OCR, model inference, capture, and overlay all together | May need responsibility split\n`captureVisibleTab()` is used as a high-frequency loop | Consider screenshot fallback vs tabCapture path\nTesseract worker is created per frame/crop | Add worker reuse/scheduler\nNo stale-job cancellation | Add sequence IDs and stale-result drop\nJS/WASM helper files are fetched dynamically from CDN | Review MV3 packaging\nCache state is implicit | Add explicit cache/offline status and recovery UI\n\nThis keeps the suggestion non-critical: if the architecture is already close, it is not a rewrite; if not, the table explains where larger changes might be.\n\n## 12. Final rough recommendation\n\nPutting it all together:\n\n> I would keep the current direction, but frame the next step as pipeline hardening. OPUS-MT fp16 + greedy seems like the safer default live path for now; NLLB is a good opt-in coverage/quality path. The offscreen-document approach still looks defensible for a low-latency OCR/translation loop, even though the HF service-worker model-host pattern is now a strong reference architecture. For OCR, I would first optimize scheduling, worker reuse, bounded queues, and stale-result cancellation before replacing Tesseract.js. The most important reliability work may be around capture strategy, model/WASM packaging, cache/offline state, and a clear local-first privacy story.",
  "title": "PaneTrans — drag-select region translation + OCR on video/canvas, built on Transformers.js (local by default)"
}