External Publication
Visit Post

PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion

Hugging Face Forums [Unofficial] May 16, 2026
Source

I can’t find a single real-world example of this working “as-is” through a search…


Are there any real solutions for full-INT8 TFLite seq2seq Transformer deployment?

Short answer: yes, but not as a simpleTFLiteConverter flag.

For a Hugging Face-style encoder-decoder Transformer such as T5, MarianMT, BART, mBART, Pegasus, M2M100, or NLLB, the realistic solution is not:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

and done.

That path can produce a valid .tflite file while the decoder becomes numerically useless. The likely reason is that the decoder’s cross-attention path is not being calibrated correctly. The encoder can quantize cleanly, while the decoder loses source conditioning and starts producing repeated tokens, random tokens, empty strings, or nonsensical translations.

The most realistic path is:

encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop
+
explicit decoder calibration
+
explicit encoder→decoder quantized boundary handling
+
Quantization Debugger
+
possibly decoder-step QAT or a custom attention delegate

There is probably no public turnkey recipe for this exact target today.


Current state of the field

I would summarize the situation like this:

Full-INT8 TFLite deployment of a Hugging Face-style encoder-decoder Transformer decoder is not a mature public path. There are good public resources for TFLite INT8 in general, good public resources for ONNX/CTranslate2 seq2seq deployment, and good research on Transformer quantization. But I could not find a validated public example of a T5/MarianMT/BART-style encoder.tflite + decoder_step.tflite full-INT8 deployment with working decoder cross-attention and custom delegate execution.

Useful references:

  • LiteRT / TFLite post-training integer quantization
  • LiteRT post-training quantization overview
  • LiteRT Quantization Debugger
  • LiteRT 8-bit quantization specification
  • Optimum ONNX encoder-decoder export guide
  • Optimum TFLite export guide
  • ONNX Runtime quantization docs
  • TensorFlow Model Optimization QAT guide
  • TensorFlow Model Optimization comprehensive QAT guide
  • CTranslate2
  • CTranslate2 quantization docs

Why the fused PTQ path fails

A fused seq2seq graph hides too much.

A seq2seq Transformer naturally runs like this:

1. Run encoder once.
2. Repeatedly run decoder for each generated token.
3. Select the next token outside the model.
4. Stop on EOS or max length.

The decoder uses the source through cross-attention:

decoder hidden state → Q
encoder hidden state → K, V

attention_scores = Q @ K.T
attention_probs = softmax(attention_scores + mask)
context = attention_probs @ V

If INT8 quantization corrupts this path, the decoder can still emit tokens because it still has:

  • decoder token embeddings,
  • decoder self-attention,
  • learned language-model priors,
  • LM-head bias,
  • forced BOS/language-token priors.

But it no longer receives useful source information. That produces:

same-ish output for unrelated sources
repeated tokens
empty strings
random tokens
nonsensical translations
BLEU collapse

That is not ordinary quantization loss. That is source-conditioning failure.


Why encoder-only calibration is insufficient

TFLite full integer quantization depends on representative data to calibrate activation ranges.

For a decoder, representative data must cover the decoder state distribution, not only source inputs.

Bad calibration:

def representative_dataset():
    for source in sources:
        yield {
            "input_ids": source["input_ids"],
            "attention_mask": source["attention_mask"],
        }

That mostly calibrates the encoder path.

A decoder needs calibration samples like:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

and those samples must represent real generation states:

BOS-only prefix
early target prefix
middle target prefix
near-EOS prefix
short source
long source
padding-heavy source
near-no-padding source
names / numbers / rare tokens
domain-specific examples

A better calibration strategy is:

200 source examples × 5 decoder prefixes

not:

1000 source examples × encoder only

The issue is not just dataset size. It is whether the decoder cross-attention tensors are ever exercised with realistic activation ranges.


Solution 1: Split the graph

The first serious solution is to stop trying to deploy the fused graph.

Do not make this the production target:

fused_seq2seq_int8.tflite

Use this instead:

encoder_int8.tflite
decoder_step_int8.tflite
host_generation_loop

This matches the architecture used by mature seq2seq export flows. Hugging Face Optimum’s ONNX path explicitly handles encoder-decoder generation by separating encoder and decoder behavior, including decoder past-key-value reuse for autoregressive generation:

  • Optimum ONNX export guide
  • Optimum ONNX export functions

Target layout:

encoder_int8.tflite

inputs:
  input_ids: int32
  attention_mask: int32

outputs:
  encoder_hidden_states: int8



decoder_step_int8.tflite

inputs:
  decoder_input_ids: int32
  decoder_attention_mask: int32
  encoder_hidden_states: int8
  encoder_attention_mask: int32

outputs:
  logits: int8

Host-side generation:

encoder_states = run_encoder(input_ids, attention_mask)

decoder_ids = [decoder_start_token_id]

for step in range(max_new_tokens):
    logits = run_decoder_step(
        decoder_input_ids=decoder_ids,
        decoder_attention_mask=make_decoder_mask(decoder_ids),
        encoder_hidden_states=encoder_states,
        encoder_attention_mask=attention_mask,
    )

    next_id = select_next_token(logits)
    decoder_ids.append(next_id)

    if next_id == eos_token_id:
        break

This does not automatically fix quantization, but it makes the problem debuggable.


Solution 2: Build decoder-specific representative data

The decoder representative dataset must feed the decoder signature directly.

Conceptual decoder calibration:

def representative_decoder_dataset():
    for src_text, tgt_text in calibration_pairs:
        encoder_inputs = tokenize_source(src_text)

        # For debugging:
        #   Use FP32 encoder states.
        #
        # For deployment fidelity:
        #   Use quantized encoder states plus the real encoder→decoder requantization bridge.
        encoder_hidden_states = run_encoder_for_calibration(encoder_inputs)

        target_ids = tokenize_target(tgt_text)

        for prefix_len in [1, 2, 4, 8, 16, 32]:
            if prefix_len > len(target_ids):
                continue

            prefix = target_ids[:prefix_len]
            prefix = pad_to_static_length(prefix, DECODER_LEN)

            yield {
                "decoder_input_ids": prefix.astype("int32"),
                "decoder_attention_mask": make_decoder_mask(prefix).astype("int32"),
                "encoder_hidden_states": encoder_hidden_states,
                "encoder_attention_mask": encoder_inputs["attention_mask"].astype("int32"),
            }

If your SavedModel has multiple signatures, the representative dataset can conceptually be split by signature:

def representative_dataset():
    for batch in encoder_calibration_batches:
        yield (
            "encode",
            {
                "input_ids": batch["input_ids"],
                "attention_mask": batch["attention_mask"],
            },
        )

    for batch in decoder_calibration_batches:
        yield (
            "decode",
            {
                "decoder_input_ids": batch["decoder_input_ids"],
                "decoder_attention_mask": batch["decoder_attention_mask"],
                "encoder_hidden_states": batch["encoder_hidden_states"],
                "encoder_attention_mask": batch["encoder_attention_mask"],
            },
        )

Relevant docs:

  • LiteRT post-training quantization
  • LiteRT post-training integer quantization

The key idea:

The decoder must be calibrated as a decoder, not as a side effect of encoder input calibration.


Solution 3: Handle the encoder→decoder quantized boundary

If the encoder and decoder are separate TFLite models, the boundary can break the model even if both models are individually valid.

The encoder output and decoder input may have different quantization parameters:

encoder output:
  scale_e
  zero_point_e

decoder encoder_hidden_states input:
  scale_d
  zero_point_d

You cannot blindly pass raw int8 bytes from the encoder output into the decoder input unless the quantization parameters match.

If they differ, requantize:

real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)

Deployment-style boundary test matrix:

Encoder Boundary Decoder Meaning
FP32 float FP32 Split-graph reference
INT8 dequantized float FP32 Tests encoder quality
FP32 quantized to decoder input INT8 Tests decoder quality
INT8 requantized INT8 Full deployment-like path

If this boundary is wrong, the symptom can look exactly like broken cross-attention:

decoder runs
but receives meaningless encoder memory

Solution 4: Move cross-attention K/V projection to the encoder side

This is an architecture-level workaround.

Normally, each decoder layer computes K/V from encoder hidden states:

K_i = W_k_i(encoder_hidden_states)
V_i = W_v_i(encoder_hidden_states)

Instead, make the encoder-side artifact produce precomputed cross-attention memory:

encoder_int8.tflite

outputs:
  cross_k_layer_0
  cross_v_layer_0
  cross_k_layer_1
  cross_v_layer_1
  ...

Then make the decoder consume those tensors directly:

decoder_step_int8.tflite

inputs:
  decoder_input_ids
  decoder_attention_mask
  cross_k_layer_0
  cross_v_layer_0
  cross_k_layer_1
  cross_v_layer_1
  ...

Why this can help:

  • K/V are computed once, not every decoder step.
  • K/V become explicit graph outputs/inputs.
  • You can inspect their quantization parameters directly.
  • You can design the encoder→decoder boundary around K/V instead of generic hidden states.
  • The decoder graph becomes more predictable.

Tradeoff:

num_decoder_layers × 2 tensors

You get more interface complexity, but also much more control.

This is one of the most promising workarounds if the failure is specifically cross-attention K/V scale mismatch.


Solution 5: Use first-step source-sensitivity testing

Before relying on BLEU, test whether the decoder still sees the source.

Use two unrelated inputs:

source A: The committee approved the budget after a long debate.
source B: The patient developed a fever after the second injection.
decoder prefix: decoder_start_token_id

Compare first-step logits:

FP32(source A, BOS) vs FP32(source B, BOS)
INT8(source A, BOS) vs INT8(source B, BOS)

Healthy behavior:

FP32 logits differ across sources.
INT8 logits also differ across sources.

Broken behavior:

FP32 logits differ across sources.
INT8 logits are nearly identical across sources.

Minimal helper:

import numpy as np

def topk_ids(logits, k=10):
    flat = np.asarray(logits).reshape(-1)
    return np.argsort(flat)[-k:][::-1]

def compare_logits(logits_a, logits_b, k=10):
    a = np.asarray(logits_a).reshape(-1).astype(np.float64)
    b = np.asarray(logits_b).reshape(-1).astype(np.float64)

    top_a = topk_ids(a, k)
    top_b = topk_ids(b, k)

    cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)

    return {
        "argmax_a": int(top_a[0]),
        "argmax_b": int(top_b[0]),
        "same_argmax": bool(top_a[0] == top_b[0]),
        "topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
        "cosine": float(cosine),
        "range_a": float(a.max() - a.min()),
        "range_b": float(b.max() - b.min()),
        "top_a": top_a.tolist(),
        "top_b": top_b.tolist(),
    }

This test is more diagnostic than BLEU.

BLEU tells you output quality is bad. First-step source sensitivity tells you whether the decoder lost encoder context immediately.


Solution 6: Use Quantization Debugger and selective rescue

Use the official Quantization Debugger to locate the first catastrophic tensor.

Relevant docs:

  • LiteRT Quantization Debugger
  • TensorFlow tf.lite.experimental.QuantizationDebugger

Start with decoder layer 0:

decoder embedding output
decoder self-attention output
cross-attention Q
cross-attention K
cross-attention V
QK^T attention scores
attention probabilities
attention_probs @ V context
cross-attention output projection
post-cross-attention residual
LM-head logits

Interpretation table:

Observation Likely cause
K/V nearly constant Encoder memory destroyed
Q/K scales incompatible Dot product corrupted
Attention scores flat Source selection lost
Attention scores extreme Softmax collapse
Context vector near zero Cross-attention muted
Residual dominates context Source signal drowned
Logits same across source inputs Decoder source-blind
Logits saturated Output scale problem

Selective quantization is useful diagnostically:

leave K/V projections float
leave QK^T score path float
leave Softmax path float
leave cross-attention output projection float
leave post-cross-attention residual float
leave LM head float

If leaving a region float restores BLEU, that region is the failure point.

This may not be deployable on a strict INT8 delegate, but it tells you what must be fixed.


Solution 7: Decoder-step QAT

If explicit decoder PTQ still fails, QAT is the next real TFLite-native option.

Relevant docs:

  • TensorFlow Model Optimization QAT guide
  • TensorFlow comprehensive QAT guide
  • TensorFlow Model Optimization QuantizeConfig

Do not begin with the fused generation graph.

Begin with:

decoder_step_qat_model

Inputs:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

Target:

next target token

Training objective:

teacher-forced next-token prediction

Prefix sampling:

BOS
BOS + token 1
BOS + tokens 1..3
middle prefix
near-EOS prefix

The QAT graph must match deployment:

same source length
same decoder prefix length
same masks
same decoder_start_token_id behavior
same encoder_hidden_states boundary
same logits output convention
same supported operator set

Important caveat:

Transformer attention QAT in TensorFlow/TFLite is not necessarily turnkey.

There are public issues around QAT support for MultiHeadAttention, which is a warning that you may need a custom Keras decoder-step implementation, custom QuantizeConfig, or manual fake-quant insertion.

Relevant issue:

  • TensorFlow Model Optimization: QAT support for MultiHeadAttention

Possible implementation routes:

custom decoder-step Keras model
custom QuantizeConfig
manual FakeQuant insertion
rewrite attention into quantizable primitives
train a smaller deployment-specific decoder

Solution 8: Custom delegate or custom op for quantized cross-attention

If you own the hardware delegate, the most robust engineering solution may be to stop relying on generic TFLite decomposition for attention.

Implement quantized cross-attention as a delegate-supported fused subgraph or custom op.

A real quantized cross-attention implementation needs to control:

Q projection scale
K projection scale
V projection scale
QK^T accumulation scale
mask representation
Softmax approximation range
attention_probs scale
attention_probs @ V accumulation
context output scale
output projection scale
residual merge scale

This is much harder than “support INT8 matmul.”

Attention contains:

FULLY_CONNECTED
RESHAPE
TRANSPOSE
BATCH_MATMUL
ADD / mask
SOFTMAX
BATCH_MATMUL
FULLY_CONNECTED
ADD / residual
possibly LayerNorm-adjacent behavior

Relevant public warnings:

  • TFLite quantized MultiHeadAttention issue
  • TFLite Micro quantized Softmax zero-point issue
  • LiteRT Torch LayerNorm full-INT8 issue

If a hardware vendor says “we support INT8 matrix multiplication,” that is not enough. Cross-attention requires correct scale propagation through the whole attention block.


Solution 9: Allow a precision exception if possible

If product constraints can change, the most natural accuracy fix is:

INT8 weights
+
INT16 or float activations for attention-sensitive paths

LiteRT documents a 16x8 mode:

  • LiteRT 16x8 post-training integer quantization

This can help when activations are sensitive to quantization, but runtime/delegate support is often limited.

If 16x8 improves quality but fails due to TILE or another unsupported op, the diagnostic meaning is still useful:

The model probably needs more activation precision.
The current delegate cannot execute the more accurate path.

Possible compromise:

INT8 encoder
INT8 FFN/projections
INT16 or float cross-attention score path
INT8 output projection

This is not pure full-INT8, but it is often closer to what Transformer quantization actually needs.


Solution 10: Distill or redesign the model for the target

If full-INT8 TFLite is absolutely mandatory and QAT/custom delegate work is too expensive, the best product path may be to change the model.

Options:

smaller encoder-decoder Transformer
fewer decoder layers
smaller hidden size
shorter max source length
fixed decoder-step window
reduced vocabulary
domain-specific translation model
non-autoregressive model if task allows
RNN/Conv seq2seq model if task allows

Train with deployment constraints from the beginning:

static shapes
teacher-forced decoder-step training
QAT during fine-tuning
delegate-supported ops only
fixed source length
fixed decoder step shape

This is less elegant, but often more robust than trying to force a general-purpose pretrained Transformer decoder into a strict embedded INT8 delegate.


Solution 11: Change runtime if allowed

If TFLite is negotiable, use a Transformer-native runtime.

CTranslate2

CTranslate2 supports many encoder-decoder Transformer families and multiple quantization modes.

Useful links:

  • CTranslate2 GitHub
  • CTranslate2 Transformers guide
  • CTranslate2 quantization

This is the easiest way to answer:

Can this model family be quantized usefully at all?

If CTranslate2 INT8 works while TFLite INT8 fails, then the model is not inherently unquantizable. The TFLite path is the issue.

ONNX Runtime

ONNX Runtime has a more mature Transformer quantization story than TFLite for many workloads.

Useful links:

  • ONNX Runtime quantization docs
  • Optimum ONNX export guide
  • Optimum ONNX export functions

Important caveat:

ONNX Runtime success does not prove full-INT8 TFLite will work.

ONNX Runtime docs generally recommend dynamic quantization for Transformer-based models, while your target requires static full-INT8 behavior. Those are different deployment regimes.


Recommended execution plan

If TFLite is mandatory, I would do this in order.

Step 1: Build split FP32 TFLite

Create:

encoder_fp32.tflite
decoder_step_fp32.tflite

Verify:

split FP32 output ≈ original Transformers output

Do not quantize until this works.


Step 2: Quantize encoder only

Create:

encoder_int8.tflite
decoder_step_fp32.tflite

If quality remains good, the encoder is not the blocker.


Step 3: Quantize decoder with decoder-specific calibration

Create:

decoder_step_int8.tflite

Use representative samples with:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

Test:

FP32 encoder + INT8 decoder
INT8 encoder + INT8 decoder

Step 4: Test source sensitivity

Compare first-step logits for two unrelated source sentences.

If INT8 logits are nearly identical, the decoder is source-blind.


Step 5: Debug cross-attention tensors

Use Quantization Debugger around:

Q
K
V
QK^T
Softmax
context
residual
LM head

Find the first catastrophic divergence.


Step 6: Apply one targeted rescue

Failure location Targeted fix
Encoder output boundary Explicit requantization bridge
K/V projections Move K/V projection to encoder side
QK score path Custom scale handling or higher precision
Softmax Custom op/delegate or precision exception
Residual merge QAT or scale control
LM head Better calibration or QAT

Step 7: Try decoder-step QAT

Use teacher-forced target prefixes.

Do not start with the fused model.


Step 8: Validate CPU INT8 before delegate

If CPU INT8 fails, the model is still quantization-broken.

If CPU INT8 works and delegate fails, the problem is delegate support.


What not to do

Do not keep iterating on fused PTQ as the main path.

Do not add only more encoder-side calibration data.

Do not assume inference_output_type=tf.float32 is the root cause.

Do not assume ONNX/CTranslate2 success transfers directly to TFLite.

Do not attempt QAT on the high-level fused Hugging Face model first.

Do not debug the custom delegate until CPU INT8 is correct.


Practical answer

If the question is:

Is there a solution?

The honest answer is:

Yes, in principle. But not as a turnkeyTFLiteConverter PTQ workflow.

The most plausible TFLite-native solution is:

1. split encoder and decoder_step
2. calibrate decoder_step explicitly with real decoder prefixes
3. handle encoder→decoder requantization
4. use first-step source-sensitivity tests
5. use Quantization Debugger around cross-attention
6. use decoder-step QAT if PTQ fails
7. add custom delegate support only after CPU INT8 works

The most plausible non-TFLite solution is:

CTranslate2 or ONNX Runtime

The most robust product solution, if TFLite full INT8 is mandatory and QAT still fails, is:

distill or redesign the model for the delegate

Short summary

  • There is probably no simple converter flag that fixes this.
  • Fused full-INT8 PTQ is probably a dead end for this model class.
  • The first real solution is encoder.tflite + decoder_step.tflite.
  • The decoder needs representative calibration with real decoder prefixes.
  • The encoder→decoder quantized boundary must be handled explicitly.
  • Cross-attention K/V may need to move to the encoder side.
  • Use Quantization Debugger to locate the first bad tensor.
  • Decoder-step QAT is the next realistic TFLite-native path.
  • A custom attention delegate may be required for strict embedded INT8.
  • If runtime constraints can change, CTranslate2 or ONNX Runtime is far more mature.
  • If constraints cannot change, distillation/redesign may be the most reliable product path.

Discussion in the ATmosphere

Loading comments...