PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion
I can’t find a single real-world example of this working “as-is” through a search…
Are there any real solutions for full-INT8 TFLite seq2seq Transformer deployment?
Short answer: yes, but not as a simpleTFLiteConverter flag.
For a Hugging Face-style encoder-decoder Transformer such as T5, MarianMT, BART, mBART, Pegasus, M2M100, or NLLB, the realistic solution is not:
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
and done.
That path can produce a valid .tflite file while the decoder becomes numerically useless. The likely reason is that the decoder’s cross-attention path is not being calibrated correctly. The encoder can quantize cleanly, while the decoder loses source conditioning and starts producing repeated tokens, random tokens, empty strings, or nonsensical translations.
The most realistic path is:
encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop
+
explicit decoder calibration
+
explicit encoder→decoder quantized boundary handling
+
Quantization Debugger
+
possibly decoder-step QAT or a custom attention delegate
There is probably no public turnkey recipe for this exact target today.
Current state of the field
I would summarize the situation like this:
Full-INT8 TFLite deployment of a Hugging Face-style encoder-decoder Transformer decoder is not a mature public path. There are good public resources for TFLite INT8 in general, good public resources for ONNX/CTranslate2 seq2seq deployment, and good research on Transformer quantization. But I could not find a validated public example of a T5/MarianMT/BART-style
encoder.tflite + decoder_step.tflitefull-INT8 deployment with working decoder cross-attention and custom delegate execution.
Useful references:
- LiteRT / TFLite post-training integer quantization
- LiteRT post-training quantization overview
- LiteRT Quantization Debugger
- LiteRT 8-bit quantization specification
- Optimum ONNX encoder-decoder export guide
- Optimum TFLite export guide
- ONNX Runtime quantization docs
- TensorFlow Model Optimization QAT guide
- TensorFlow Model Optimization comprehensive QAT guide
- CTranslate2
- CTranslate2 quantization docs
Why the fused PTQ path fails
A fused seq2seq graph hides too much.
A seq2seq Transformer naturally runs like this:
1. Run encoder once.
2. Repeatedly run decoder for each generated token.
3. Select the next token outside the model.
4. Stop on EOS or max length.
The decoder uses the source through cross-attention:
decoder hidden state → Q
encoder hidden state → K, V
attention_scores = Q @ K.T
attention_probs = softmax(attention_scores + mask)
context = attention_probs @ V
If INT8 quantization corrupts this path, the decoder can still emit tokens because it still has:
- decoder token embeddings,
- decoder self-attention,
- learned language-model priors,
- LM-head bias,
- forced BOS/language-token priors.
But it no longer receives useful source information. That produces:
same-ish output for unrelated sources
repeated tokens
empty strings
random tokens
nonsensical translations
BLEU collapse
That is not ordinary quantization loss. That is source-conditioning failure.
Why encoder-only calibration is insufficient
TFLite full integer quantization depends on representative data to calibrate activation ranges.
For a decoder, representative data must cover the decoder state distribution, not only source inputs.
Bad calibration:
def representative_dataset():
for source in sources:
yield {
"input_ids": source["input_ids"],
"attention_mask": source["attention_mask"],
}
That mostly calibrates the encoder path.
A decoder needs calibration samples like:
decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask
and those samples must represent real generation states:
BOS-only prefix
early target prefix
middle target prefix
near-EOS prefix
short source
long source
padding-heavy source
near-no-padding source
names / numbers / rare tokens
domain-specific examples
A better calibration strategy is:
200 source examples × 5 decoder prefixes
not:
1000 source examples × encoder only
The issue is not just dataset size. It is whether the decoder cross-attention tensors are ever exercised with realistic activation ranges.
Solution 1: Split the graph
The first serious solution is to stop trying to deploy the fused graph.
Do not make this the production target:
fused_seq2seq_int8.tflite
Use this instead:
encoder_int8.tflite
decoder_step_int8.tflite
host_generation_loop
This matches the architecture used by mature seq2seq export flows. Hugging Face Optimum’s ONNX path explicitly handles encoder-decoder generation by separating encoder and decoder behavior, including decoder past-key-value reuse for autoregressive generation:
- Optimum ONNX export guide
- Optimum ONNX export functions
Target layout:
encoder_int8.tflite
inputs:
input_ids: int32
attention_mask: int32
outputs:
encoder_hidden_states: int8
decoder_step_int8.tflite
inputs:
decoder_input_ids: int32
decoder_attention_mask: int32
encoder_hidden_states: int8
encoder_attention_mask: int32
outputs:
logits: int8
Host-side generation:
encoder_states = run_encoder(input_ids, attention_mask)
decoder_ids = [decoder_start_token_id]
for step in range(max_new_tokens):
logits = run_decoder_step(
decoder_input_ids=decoder_ids,
decoder_attention_mask=make_decoder_mask(decoder_ids),
encoder_hidden_states=encoder_states,
encoder_attention_mask=attention_mask,
)
next_id = select_next_token(logits)
decoder_ids.append(next_id)
if next_id == eos_token_id:
break
This does not automatically fix quantization, but it makes the problem debuggable.
Solution 2: Build decoder-specific representative data
The decoder representative dataset must feed the decoder signature directly.
Conceptual decoder calibration:
def representative_decoder_dataset():
for src_text, tgt_text in calibration_pairs:
encoder_inputs = tokenize_source(src_text)
# For debugging:
# Use FP32 encoder states.
#
# For deployment fidelity:
# Use quantized encoder states plus the real encoder→decoder requantization bridge.
encoder_hidden_states = run_encoder_for_calibration(encoder_inputs)
target_ids = tokenize_target(tgt_text)
for prefix_len in [1, 2, 4, 8, 16, 32]:
if prefix_len > len(target_ids):
continue
prefix = target_ids[:prefix_len]
prefix = pad_to_static_length(prefix, DECODER_LEN)
yield {
"decoder_input_ids": prefix.astype("int32"),
"decoder_attention_mask": make_decoder_mask(prefix).astype("int32"),
"encoder_hidden_states": encoder_hidden_states,
"encoder_attention_mask": encoder_inputs["attention_mask"].astype("int32"),
}
If your SavedModel has multiple signatures, the representative dataset can conceptually be split by signature:
def representative_dataset():
for batch in encoder_calibration_batches:
yield (
"encode",
{
"input_ids": batch["input_ids"],
"attention_mask": batch["attention_mask"],
},
)
for batch in decoder_calibration_batches:
yield (
"decode",
{
"decoder_input_ids": batch["decoder_input_ids"],
"decoder_attention_mask": batch["decoder_attention_mask"],
"encoder_hidden_states": batch["encoder_hidden_states"],
"encoder_attention_mask": batch["encoder_attention_mask"],
},
)
Relevant docs:
- LiteRT post-training quantization
- LiteRT post-training integer quantization
The key idea:
The decoder must be calibrated as a decoder, not as a side effect of encoder input calibration.
Solution 3: Handle the encoder→decoder quantized boundary
If the encoder and decoder are separate TFLite models, the boundary can break the model even if both models are individually valid.
The encoder output and decoder input may have different quantization parameters:
encoder output:
scale_e
zero_point_e
decoder encoder_hidden_states input:
scale_d
zero_point_d
You cannot blindly pass raw int8 bytes from the encoder output into the decoder input unless the quantization parameters match.
If they differ, requantize:
real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)
Deployment-style boundary test matrix:
| Encoder | Boundary | Decoder | Meaning |
|---|---|---|---|
| FP32 | float | FP32 | Split-graph reference |
| INT8 | dequantized float | FP32 | Tests encoder quality |
| FP32 | quantized to decoder input | INT8 | Tests decoder quality |
| INT8 | requantized | INT8 | Full deployment-like path |
If this boundary is wrong, the symptom can look exactly like broken cross-attention:
decoder runs
but receives meaningless encoder memory
Solution 4: Move cross-attention K/V projection to the encoder side
This is an architecture-level workaround.
Normally, each decoder layer computes K/V from encoder hidden states:
K_i = W_k_i(encoder_hidden_states)
V_i = W_v_i(encoder_hidden_states)
Instead, make the encoder-side artifact produce precomputed cross-attention memory:
encoder_int8.tflite
outputs:
cross_k_layer_0
cross_v_layer_0
cross_k_layer_1
cross_v_layer_1
...
Then make the decoder consume those tensors directly:
decoder_step_int8.tflite
inputs:
decoder_input_ids
decoder_attention_mask
cross_k_layer_0
cross_v_layer_0
cross_k_layer_1
cross_v_layer_1
...
Why this can help:
- K/V are computed once, not every decoder step.
- K/V become explicit graph outputs/inputs.
- You can inspect their quantization parameters directly.
- You can design the encoder→decoder boundary around K/V instead of generic hidden states.
- The decoder graph becomes more predictable.
Tradeoff:
num_decoder_layers × 2 tensors
You get more interface complexity, but also much more control.
This is one of the most promising workarounds if the failure is specifically cross-attention K/V scale mismatch.
Solution 5: Use first-step source-sensitivity testing
Before relying on BLEU, test whether the decoder still sees the source.
Use two unrelated inputs:
source A: The committee approved the budget after a long debate.
source B: The patient developed a fever after the second injection.
decoder prefix: decoder_start_token_id
Compare first-step logits:
FP32(source A, BOS) vs FP32(source B, BOS)
INT8(source A, BOS) vs INT8(source B, BOS)
Healthy behavior:
FP32 logits differ across sources.
INT8 logits also differ across sources.
Broken behavior:
FP32 logits differ across sources.
INT8 logits are nearly identical across sources.
Minimal helper:
import numpy as np
def topk_ids(logits, k=10):
flat = np.asarray(logits).reshape(-1)
return np.argsort(flat)[-k:][::-1]
def compare_logits(logits_a, logits_b, k=10):
a = np.asarray(logits_a).reshape(-1).astype(np.float64)
b = np.asarray(logits_b).reshape(-1).astype(np.float64)
top_a = topk_ids(a, k)
top_b = topk_ids(b, k)
cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)
return {
"argmax_a": int(top_a[0]),
"argmax_b": int(top_b[0]),
"same_argmax": bool(top_a[0] == top_b[0]),
"topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
"cosine": float(cosine),
"range_a": float(a.max() - a.min()),
"range_b": float(b.max() - b.min()),
"top_a": top_a.tolist(),
"top_b": top_b.tolist(),
}
This test is more diagnostic than BLEU.
BLEU tells you output quality is bad. First-step source sensitivity tells you whether the decoder lost encoder context immediately.
Solution 6: Use Quantization Debugger and selective rescue
Use the official Quantization Debugger to locate the first catastrophic tensor.
Relevant docs:
- LiteRT Quantization Debugger
- TensorFlow tf.lite.experimental.QuantizationDebugger
Start with decoder layer 0:
decoder embedding output
decoder self-attention output
cross-attention Q
cross-attention K
cross-attention V
QK^T attention scores
attention probabilities
attention_probs @ V context
cross-attention output projection
post-cross-attention residual
LM-head logits
Interpretation table:
| Observation | Likely cause |
|---|---|
| K/V nearly constant | Encoder memory destroyed |
| Q/K scales incompatible | Dot product corrupted |
| Attention scores flat | Source selection lost |
| Attention scores extreme | Softmax collapse |
| Context vector near zero | Cross-attention muted |
| Residual dominates context | Source signal drowned |
| Logits same across source inputs | Decoder source-blind |
| Logits saturated | Output scale problem |
Selective quantization is useful diagnostically:
leave K/V projections float
leave QK^T score path float
leave Softmax path float
leave cross-attention output projection float
leave post-cross-attention residual float
leave LM head float
If leaving a region float restores BLEU, that region is the failure point.
This may not be deployable on a strict INT8 delegate, but it tells you what must be fixed.
Solution 7: Decoder-step QAT
If explicit decoder PTQ still fails, QAT is the next real TFLite-native option.
Relevant docs:
- TensorFlow Model Optimization QAT guide
- TensorFlow comprehensive QAT guide
- TensorFlow Model Optimization QuantizeConfig
Do not begin with the fused generation graph.
Begin with:
decoder_step_qat_model
Inputs:
decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask
Target:
next target token
Training objective:
teacher-forced next-token prediction
Prefix sampling:
BOS
BOS + token 1
BOS + tokens 1..3
middle prefix
near-EOS prefix
The QAT graph must match deployment:
same source length
same decoder prefix length
same masks
same decoder_start_token_id behavior
same encoder_hidden_states boundary
same logits output convention
same supported operator set
Important caveat:
Transformer attention QAT in TensorFlow/TFLite is not necessarily turnkey.
There are public issues around QAT support for MultiHeadAttention, which is a warning that you may need a custom Keras decoder-step implementation, custom QuantizeConfig, or manual fake-quant insertion.
Relevant issue:
- TensorFlow Model Optimization: QAT support for MultiHeadAttention
Possible implementation routes:
custom decoder-step Keras model
custom QuantizeConfig
manual FakeQuant insertion
rewrite attention into quantizable primitives
train a smaller deployment-specific decoder
Solution 8: Custom delegate or custom op for quantized cross-attention
If you own the hardware delegate, the most robust engineering solution may be to stop relying on generic TFLite decomposition for attention.
Implement quantized cross-attention as a delegate-supported fused subgraph or custom op.
A real quantized cross-attention implementation needs to control:
Q projection scale
K projection scale
V projection scale
QK^T accumulation scale
mask representation
Softmax approximation range
attention_probs scale
attention_probs @ V accumulation
context output scale
output projection scale
residual merge scale
This is much harder than “support INT8 matmul.”
Attention contains:
FULLY_CONNECTED
RESHAPE
TRANSPOSE
BATCH_MATMUL
ADD / mask
SOFTMAX
BATCH_MATMUL
FULLY_CONNECTED
ADD / residual
possibly LayerNorm-adjacent behavior
Relevant public warnings:
- TFLite quantized MultiHeadAttention issue
- TFLite Micro quantized Softmax zero-point issue
- LiteRT Torch LayerNorm full-INT8 issue
If a hardware vendor says “we support INT8 matrix multiplication,” that is not enough. Cross-attention requires correct scale propagation through the whole attention block.
Solution 9: Allow a precision exception if possible
If product constraints can change, the most natural accuracy fix is:
INT8 weights
+
INT16 or float activations for attention-sensitive paths
LiteRT documents a 16x8 mode:
- LiteRT 16x8 post-training integer quantization
This can help when activations are sensitive to quantization, but runtime/delegate support is often limited.
If 16x8 improves quality but fails due to TILE or another unsupported op, the diagnostic meaning is still useful:
The model probably needs more activation precision.
The current delegate cannot execute the more accurate path.
Possible compromise:
INT8 encoder
INT8 FFN/projections
INT16 or float cross-attention score path
INT8 output projection
This is not pure full-INT8, but it is often closer to what Transformer quantization actually needs.
Solution 10: Distill or redesign the model for the target
If full-INT8 TFLite is absolutely mandatory and QAT/custom delegate work is too expensive, the best product path may be to change the model.
Options:
smaller encoder-decoder Transformer
fewer decoder layers
smaller hidden size
shorter max source length
fixed decoder-step window
reduced vocabulary
domain-specific translation model
non-autoregressive model if task allows
RNN/Conv seq2seq model if task allows
Train with deployment constraints from the beginning:
static shapes
teacher-forced decoder-step training
QAT during fine-tuning
delegate-supported ops only
fixed source length
fixed decoder step shape
This is less elegant, but often more robust than trying to force a general-purpose pretrained Transformer decoder into a strict embedded INT8 delegate.
Solution 11: Change runtime if allowed
If TFLite is negotiable, use a Transformer-native runtime.
CTranslate2
CTranslate2 supports many encoder-decoder Transformer families and multiple quantization modes.
Useful links:
- CTranslate2 GitHub
- CTranslate2 Transformers guide
- CTranslate2 quantization
This is the easiest way to answer:
Can this model family be quantized usefully at all?
If CTranslate2 INT8 works while TFLite INT8 fails, then the model is not inherently unquantizable. The TFLite path is the issue.
ONNX Runtime
ONNX Runtime has a more mature Transformer quantization story than TFLite for many workloads.
Useful links:
- ONNX Runtime quantization docs
- Optimum ONNX export guide
- Optimum ONNX export functions
Important caveat:
ONNX Runtime success does not prove full-INT8 TFLite will work.
ONNX Runtime docs generally recommend dynamic quantization for Transformer-based models, while your target requires static full-INT8 behavior. Those are different deployment regimes.
Recommended execution plan
If TFLite is mandatory, I would do this in order.
Step 1: Build split FP32 TFLite
Create:
encoder_fp32.tflite
decoder_step_fp32.tflite
Verify:
split FP32 output ≈ original Transformers output
Do not quantize until this works.
Step 2: Quantize encoder only
Create:
encoder_int8.tflite
decoder_step_fp32.tflite
If quality remains good, the encoder is not the blocker.
Step 3: Quantize decoder with decoder-specific calibration
Create:
decoder_step_int8.tflite
Use representative samples with:
decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask
Test:
FP32 encoder + INT8 decoder
INT8 encoder + INT8 decoder
Step 4: Test source sensitivity
Compare first-step logits for two unrelated source sentences.
If INT8 logits are nearly identical, the decoder is source-blind.
Step 5: Debug cross-attention tensors
Use Quantization Debugger around:
Q
K
V
QK^T
Softmax
context
residual
LM head
Find the first catastrophic divergence.
Step 6: Apply one targeted rescue
| Failure location | Targeted fix |
|---|---|
| Encoder output boundary | Explicit requantization bridge |
| K/V projections | Move K/V projection to encoder side |
| QK score path | Custom scale handling or higher precision |
| Softmax | Custom op/delegate or precision exception |
| Residual merge | QAT or scale control |
| LM head | Better calibration or QAT |
Step 7: Try decoder-step QAT
Use teacher-forced target prefixes.
Do not start with the fused model.
Step 8: Validate CPU INT8 before delegate
If CPU INT8 fails, the model is still quantization-broken.
If CPU INT8 works and delegate fails, the problem is delegate support.
What not to do
Do not keep iterating on fused PTQ as the main path.
Do not add only more encoder-side calibration data.
Do not assume inference_output_type=tf.float32 is the root cause.
Do not assume ONNX/CTranslate2 success transfers directly to TFLite.
Do not attempt QAT on the high-level fused Hugging Face model first.
Do not debug the custom delegate until CPU INT8 is correct.
Practical answer
If the question is:
Is there a solution?
The honest answer is:
Yes, in principle. But not as a turnkey
TFLiteConverterPTQ workflow.
The most plausible TFLite-native solution is:
1. split encoder and decoder_step
2. calibrate decoder_step explicitly with real decoder prefixes
3. handle encoder→decoder requantization
4. use first-step source-sensitivity tests
5. use Quantization Debugger around cross-attention
6. use decoder-step QAT if PTQ fails
7. add custom delegate support only after CPU INT8 works
The most plausible non-TFLite solution is:
CTranslate2 or ONNX Runtime
The most robust product solution, if TFLite full INT8 is mandatory and QAT still fails, is:
distill or redesign the model for the delegate
Short summary
- There is probably no simple converter flag that fixes this.
- Fused full-INT8 PTQ is probably a dead end for this model class.
- The first real solution is
encoder.tflite + decoder_step.tflite. - The decoder needs representative calibration with real decoder prefixes.
- The encoder→decoder quantized boundary must be handled explicitly.
- Cross-attention K/V may need to move to the encoder side.
- Use Quantization Debugger to locate the first bad tensor.
- Decoder-step QAT is the next realistic TFLite-native path.
- A custom attention delegate may be required for strict embedded INT8.
- If runtime constraints can change, CTranslate2 or ONNX Runtime is far more mature.
- If constraints cannot change, distillation/redesign may be the most reliable product path.
Discussion in the ATmosphere