External Publication
Visit Post

PaneTrans — drag-select region translation + OCR on video/canvas, built on Transformers.js (local by default)

Hugging Face Forums [Unofficial] June 19, 2026
Source

I’m not very familiar with JavaScript, but after doing a bit of digging, my rough understanding is:


This looks like a reasonable direction to me. I would probably frame PaneTrans less as a pure “which translation model is best?” question and more as a live browser-extension pipeline question.

My rough answer to the three questions would be:

Topic My rough take
OPUS-MT vs quantized NLLB I would keep OPUS-MT fp16 + greedy as the default live/hot path for now, and treat NLLB as an opt-in coverage/quality path unless current Transformers.js v4 benchmarks show otherwise.
Offscreen document vs service worker I would not call the offscreen-document approach obsolete. The newer Hugging Face guide, How to Use Transformers.js in a Chrome Extension, makes the background service-worker model-host pattern worth comparing against, but PaneTrans has a more latency-sensitive OCR/overlay loop than a normal assistant extension.
Transformers.js-native OCR Interesting later, but I would first harden the OCR/capture scheduling layer before replacing Tesseract.js. The Tesseract.js performance docs and API docs are useful references for worker reuse and schedulers.
Most valuable next work Probably pipeline hardening: ROI-only OCR, stale-job cancellation, bounded OCR queue, cache/offline state, extension-local JS/WASM packaging, and a clear local/cloud privacy story.

1. I would separate the problem into layers

The project becomes easier to reason about if the architecture is split into layers:

Layer Main question
Text source layer Is the text available from the DOM, or does it require visual capture/OCR?
Capture layer Is this a static screenshot-like case, or a continuous video/canvas/game/subtitle case?
OCR layer How often should OCR run, and how are old OCR jobs discarded?
Translation layer Which translation model is good enough for low-latency live updates?
Runtime layer Should warm OCR/model state live in the service worker, an offscreen document, or a hybrid?
Cache/offline layer What happens on first run, offline use, broken cache, or model update?
Store/privacy layer Can the data flow be explained as local-first, user-triggered, and selected-region-only?

That framing prevents the model choice from carrying too much architectural weight. For live overlays, the visible user experience depends on the whole chain:

source text / visual region changes
→ DOM extraction or visual capture
→ OCR, if needed
→ translation
→ stale-result filtering
→ overlay update

So I would measure and optimize the end-to-end overlay path , not only model throughput.

2. Q1: OPUS-MT vs quantized NLLB

I could not find a clean public benchmark of OPUS-MT vs quantized NLLB-200 specifically on current Transformers.js v4/WebGPU. So I would avoid making a global claim like “OPUS-MT is better” or “NLLB is better.”

The indirect evidence suggests a split design.

Role OPUS-MT / Marian-style models NLLB-200 distilled
Natural role Default live hot path Opt-in coverage/quality mode
Model shape Pair-specific translation models Broad multilingual model
Strength Smaller, simpler, easier to reason about per language pair Wide coverage, especially useful for low-resource language support
Browser risk Still needs testing, but likely easier to keep lightweight More likely to stress load time, memory, dtype/backend stability
Best decision criterion “Good enough quality + low latency” “Extra coverage/quality worth the cost”

For NLLB, the main argument is not necessarily browser live latency. It is multilingual coverage. The NLLB-200 distilled 600M model card and the NLLB paper make the coverage and low-resource language story clear. That is valuable, but it is a different optimization target from rapidly changing chat/ticker/subtitle overlays.

For OPUS-MT, the OPUS-MT project and the OPUS/Marian ecosystem are more pair-specific. That makes it easier to choose and evaluate per language pair. The OPUS-MT paper also discusses compact and speed-oriented translation models for real-time use cases: Democratizing Machine Translation with OPUS-MT.

There are also some caution signs around quantized decoder paths in browser/WebGPU settings. For example, Transformers.js issue #1317 reports q8 decoder output problems on WebGPU, and Transformers.js issue #1518 reports a WebGPU translation crash path around NLLB/translation.

Those issues do not prove that NLLB is a bad choice. They only suggest that quantized NLLB/WebGPU should be benchmarked as a real pipeline before becoming the default hot path.

So my rough recommendation would be:

Keep OPUS-MT fp16 + greedy as the default live-region path. Treat NLLB as an opt-in coverage/quality path, especially for language pairs where OPUS-MT quality or availability is not good enough. Promote NLLB to the hot path only after a current Transformers.js v4 benchmark shows it is faster, stable, and good enough for the exact language pairs.

Benchmarking Q1

I would not only measure tokens/sec. A live overlay benchmark probably needs:

Metric Why it matters
Cold load time First-use friction
Download size User adoption and cache reliability
Warm p50 latency Normal overlay responsiveness
Warm p95 latency Visible lag spikes
Memory / GPU failure rate Browser/WebGPU stability
Stale-result rate Whether old translations overwrite newer text
Language-pair quality Whether NLLB’s coverage advantage matters for the target pair
Offline restart test Whether cached models actually work when offline

A practical benchmark set might look like:

Models:
- OPUS-MT fp16 WebGPU, where available
- OPUS-MT q8/q4/WASM, if useful and available
- NLLB q8/q4/q4f16/WebGPU, if available
- Optional: cloud/API baseline only for quality comparison, not default local path

Inputs:
- 20 short chat-like strings
- 20 ticker/UI-like strings
- 20 subtitle-like strings
- 5 language pairs, including at least one low-resource pair if that is a target

Report:
- cold load
- warm p50/p95
- memory
- failures
- visible update latency
- translation quality notes

The important number is probably not model throughput alone, but:

source text changes
→ OCR/DOM path notices it
→ translation runs
→ overlay updates

That is the user-visible latency.

3. Q2: offscreen document vs service worker

I would separate two questions:

  1. Can WebGPU run in a service worker now?
  2. Is the MV3 service worker the best lifecycle container for a warm, resident OCR/translation engine?

The first answer is increasingly “yes.” Hugging Face now has a very relevant guide: How to Use Transformers.js in a Chrome Extension. That guide is explicitly about using Transformers.js under Chrome Manifest V3 constraints, and it describes an architecture with:

  • a background service worker that hosts models,
  • a side panel for chat UI,
  • a content script for page-level actions.

That is now a strong reference architecture for Transformers.js Chrome extensions.

But the second question is more subtle. Chrome extension service workers are not persistent Manifest V2 background pages. The Chrome docs explain that extension service workers can be terminated after inactivity or long-running work: The extension service worker lifecycle.

So I would not read the HF guide as “offscreen is obsolete.” I would read it as:

The service-worker-hosted model pattern is now a serious reference architecture for Transformers.js Chrome extensions.

PaneTrans has a different hot path from a normal assistant extension. A chat assistant can often tolerate recoverable model state. A live overlay translator may care more about warm latency, OCR loops, frame scheduling, long-running visual capture, and avoiding stale overlay updates.

The offscreen document still seems defensible for this kind of workload, especially if it is used as a dedicated helper for DOM/canvas/media/OCR/model work. Chrome’s Offscreen API describes it as a hidden document for cases where a service worker needs DOM-like capabilities. There is also a related example where the offscreen document hosts a Transformers.js/WebGPU model while the service worker routes messages: Gemma Gem.

I would phrase the architecture choice as a comparison, not a correction:

Option Good for Main caution
Service worker model host Aligns with the HF extension guide; good central coordinator Must treat model state as recoverable; lifecycle can interrupt warm state
Offscreen model/OCR helper Better fit for resident media/OCR/canvas-style work Should not become an unstructured “second background page”; keep message boundaries clear
Hybrid Service worker handles permissions/routing/cache state; offscreen handles capture/OCR/model loop More moving parts, but likely clearer for PaneTrans-like workloads

If the current implementation already has a clean offscreen model/OCR worker, I would treat this as hardening rather than a rewrite.

If the current implementation puts most OCR/capture/model work directly in the content script, or relies on frequent screenshots, then moving toward a clearer service-worker/offscreen split may be a larger change.

A possible responsibility split:

Component Possible responsibility
Content script Region selection, DOM text extraction, overlay positioning, MutationObserver
Service worker Permissions, tab routing, tabCapture stream ID, cache/model status, mode switching
Offscreen document Media/canvas processing, OCR worker pool, optional warm model host
Side panel / popup Language settings, local/cloud mode, cache status, download progress, privacy state

This is not the only architecture, but it is a useful mental model.

4. Q3: OCR direction

For OCR, I would first optimize the scheduling layer before replacing the OCR engine.

Tesseract.js is not necessarily glamorous, but it is a practical browser OCR baseline. The important part is not to call OCR too often and not to create/destroy workers unnecessarily. The Tesseract.js performance docs warn against arbitrary worker creation and recommend reusing workers or using a bounded scheduler/pool for parallel work. The scheduler API is documented here: Tesseract.js API docs. A conceptual comparison of workers and schedulers is also in workers_vs_schedulers.md.

For PaneTrans, I would probably think about the OCR loop like this:

selected ROI
→ crop
→ downscale if useful
→ cheap frame diff / visual-change check
→ debounce
→ bounded OCR queue
→ Tesseract worker or scheduler
→ OCR text changed?
→ translation
→ discard stale result if a newer ROI/text version exists

The key point is: do not OCR every frame just because the frame exists.

A practical OCR hardening list:

OCR hardening item Why it helps
ROI-only OCR Less work, clearer privacy story
Frame diff before OCR Avoids repeated OCR on unchanged regions
Debounce Avoids OCR during visual transitions
Bounded queue Prevents backlog when frames change faster than OCR completes
Sequence IDs Prevents old OCR/translation jobs from overwriting newer text
Worker reuse Avoids repeated Tesseract startup cost
Worker recycle Useful for long-running sessions if memory grows
Language-specific OCR config Avoids unnecessary OCR language packs

Transformers.js-native OCR is still interesting. For example, TrOCR-like or PaddleOCR-like routes could become useful later. But I would not expect them to be a drop-in replacement for live video/canvas OCR unless detection, layout, cropping, size, initialization time, and browser memory are also handled.

So I would answer Q3 as:

Yes, I would be interested in Transformers.js-native OCR options, but I would first make the Tesseract/capture pipeline boringly reliable. The immediate wins are probably ROI scheduling, stale-job cancellation, worker reuse, and bounded queues, not swapping the OCR model.

5. Capture/input pipeline

I would probably separate input paths rather than treat all text sources the same:

Path Use case Notes
DOM text path Normal web pages, chat, tickers, dynamic DOM text Best first path: fast, low permission burden, no OCR
Static screenshot path Occasional image/canvas text Good fallback, but not ideal as a high-frequency loop
tabCapture/offscreen path Video, games, subtitles, frequently changing canvas More complex, but likely better for continuous visual text

For screenshots, captureVisibleTab() can be useful, but I would avoid making it the high-frequency OCR loop. Chrome documents a maximum captureVisibleTab call rate, exposed as MAX_CAPTURE_VISIBLE_TAB_CALLS_PER_SECOND, in the chrome.tabs API.

For continuous media-like cases, tabCapture is probably worth considering. The chrome.tabCapture API covers capturing the current tab’s media stream, with user invocation constraints.

A possible layered approach:

DOM text available?
  → use DOM text path

visual static region?
  → use low-frequency screenshot OCR fallback

continuous video/canvas/game/subtitle region?
  → consider tabCapture + offscreen document

This also helps with Chrome Web Store review and user trust: process the selected region, not the whole page, and make the data flow easy to explain.

6. Canvas/video edge cases

One easy trap is assuming that because something is visible, it is always directly readable as pixels from the page.

For example, if a page canvas contains cross-origin images or video without the right CORS conditions, the canvas can become tainted. Then APIs such as getImageData(), toBlob(), or toDataURL() may fail. MDN has a clear explanation here: Use cross-origin images in a canvas.

So I would design the visual-input path with fallback tiers:

1. DOM text path
2. direct element/canvas/image path, if safely readable
3. selected-region screenshot fallback
4. tabCapture/offscreen path for continuous visual regions

For video/canvas performance, another future optimization is avoiding unnecessary full-frame CPU readback. There is a related Transformers.js discussion/request around VideoFrame / WebCodecs / avoiding getImageData()-style paths here: Transformers.js issue #1385: Add VideoFrame support to RawImage.

I would not make that a first rewrite requirement, but I would keep it in mind if the video OCR path becomes a bottleneck.

7. Cache/offline/packaging reliability

For a local-first extension, I would separate three things:

  1. Model weights
  2. Transformers.js runtime code
  3. ONNX Runtime WASM/factory files

The model weights can be downloaded and cached. But extension runtime code and WASM helper files need special care under Manifest V3.

There is a relevant Transformers.js issue where a Chrome extension hit remote-hosted-code problems while trying to load ORT helper files from a CDN: Transformers.js issue #1248: Chrome extension fails due to remotely hosted code. Chrome’s own documentation on remote hosted code explains why this matters for MV3 review: Deal with remote hosted code violations.

So I would make cache/offline state explicit in the UI:

State UI meaning
Not downloaded First run required
Downloading Show progress
Cached Model files available
Offline-ready Model + runtime/WASM path usable without network
Missing file Cache incomplete
Redownload needed Version/cache mismatch
Clear cache User recovery path

The Transformers.js environment settings are also worth reviewing: Transformers.js env API. Depending on the final packaging, settings such as local model paths, remote model allowance, browser/WASM cache, custom fetch, and cache keys may matter. The general custom model usage docs are also relevant: Transformers.js custom usage. For dtype selection, the Transformers.js dtype guide is worth checking because available dtypes are model/repo dependent rather than universal.

A conservative cache/offline design could be:

First run:
  - show model choice
  - show approximate download size
  - download with progress
  - verify required files are present
  - run a tiny smoke test

Offline mode:
  - do not silently attempt remote fetch
  - show whether each model is offline-ready
  - if a file is missing, explain that first download is required

Recovery:
  - clear/redownload model cache
  - version cache keys by model/revision
  - handle partial cache or failed download visibly

In the HF Chrome-extension demo repo, the ModelRegistry style of checking needed files, metadata, size, and cache state is useful as a pattern. Even if PaneTrans does not copy that exact implementation, the product behavior is good: the user should know what is downloaded, what is cached, and what works offline.

8. Chrome Web Store / privacy story

I would treat the store/privacy story as part of the architecture, not as final paperwork.

PaneTrans may touch page text, screenshots, OCR output, and translated text. Chrome’s Limited Use policy says web browsing activity collection/use should be limited to the user-facing feature and not repurposed. The User data FAQ also explains privacy policy and sensitive data expectations.

A privacy-friendly implementation also helps performance:

Privacy-friendly choice Performance benefit
Process selected ROI only Less OCR work
Local-first by default Lower latency and clearer trust model
Do not store page screenshots Less storage/cache complexity
Drop intermediate frames Lower memory pressure
Separate local/cloud mode Clearer UX and easier policy explanation
Optional permissions where possible Lower permission friction

I would probably describe the intended data flow explicitly:

- OCR runs only on the selected region.
- Local mode does not send page images/text to a server.
- Cloud mode, if enabled, is separate and explicit.
- Intermediate screenshots/frames are not retained.
- Model cache stores model/runtime files, not page content.

This is not just a policy matter. It is also a useful architecture constraint.

9. Suggested hardening checklist

If I were hardening PaneTrans, I would probably prioritize this order:

Priority Item Why
1 Keep DOM text extraction as the first path Fastest, least fragile, least permission-heavy
2 OCR only the selected ROI Lower cost, clearer privacy story
3 Add sequence IDs to OCR and translation jobs Prevent stale result overwrites
4 Drop stale OCR/translation results Critical for live regions
5 Add debounce and frame-diff before OCR Avoid unnecessary OCR
6 Reuse Tesseract workers Avoid repeated initialization
7 Use a bounded scheduler if parallelism is needed Avoid unbounded memory growth
8 Treat captureVisibleTab() as low-frequency fallback Avoid high-frequency screenshot loop
9 Consider tabCapture + offscreen for continuous regions Better fit for video/canvas/subtitle cases
10 Show model download/cache/offline status Better first-run and failure recovery
11 Bundle/configure JS/WASM runtime files in an MV3-friendly way Avoid remote-hosted-code issues
12 Keep local/cloud mode and privacy disclosure explicit Easier user trust and store review
13 Benchmark OPUS-MT and NLLB as end-to-end overlay paths Measures the real UX, not just model speed

10. Suggested benchmark matrix

For the model/runtime question, I would not only measure tokens/sec.

A more useful validation matrix might be:

Benchmark item Why
Cold model load time First-use experience
Warm p50 latency Normal live overlay feel
Warm p95 latency Lag spikes
Download size User friction
Memory usage WebGPU/browser stability
OCR calls/min Whether scheduling is working
Stale-result drop rate Whether old results are safely discarded
Translation quality by language pair Whether NLLB’s coverage advantage matters
Offline restart test Whether cache/offline actually works
Long-session test Whether OCR workers/memory stay healthy
Store-review readiness Whether permissions/data flow are explainable

I would especially measure “visible update latency”:

source text changes
→ DOM/capture/OCR detects it
→ translation runs
→ stale result check passes
→ overlay updates

That seems more relevant than model throughput alone.

11. If the current implementation already does X…

Because I do not know the internal implementation, I would phrase implementation advice conditionally.

If current PaneTrans already does this… Then I would treat the next work as…
Content script only handles selection/overlay/DOM extraction Mostly hardening
Offscreen document already hosts OCR/model work Mostly hardening
Service worker already handles routing/permissions/cache state Mostly hardening
OCR already runs only on selected ROI Mostly hardening
Model/cache status is already visible Mostly UX refinement
If current PaneTrans instead does this… Then the change may be larger
Content script owns OCR, model inference, capture, and overlay all together May need responsibility split
captureVisibleTab() is used as a high-frequency loop Consider screenshot fallback vs tabCapture path
Tesseract worker is created per frame/crop Add worker reuse/scheduler
No stale-job cancellation Add sequence IDs and stale-result drop
JS/WASM helper files are fetched dynamically from CDN Review MV3 packaging
Cache state is implicit Add explicit cache/offline status and recovery UI

This keeps the suggestion non-critical: if the architecture is already close, it is not a rewrite; if not, the table explains where larger changes might be.

12. Final rough recommendation

Putting it all together:

I would keep the current direction, but frame the next step as pipeline hardening. OPUS-MT fp16 + greedy seems like the safer default live path for now; NLLB is a good opt-in coverage/quality path. The offscreen-document approach still looks defensible for a low-latency OCR/translation loop, even though the HF service-worker model-host pattern is now a strong reference architecture. For OCR, I would first optimize scheduling, worker reuse, bounded queues, and stale-result cancellation before replacing Tesseract.js. The most important reliability work may be around capture strategy, model/WASM packaging, cache/offline state, and a clear local-first privacy story.

Discussion in the ATmosphere

Loading comments...