Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibllqvf36ltkrbqltsjc3tmknct32qus4aefil2u5yi3g7azbat4a",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkkxcqfu65i2"
  },
  "path": "/t/ptq-int8-via-tfliteconverter-encoder-decoder-seq2seq-model-loses-encoder-context-entirely-after-conversion/175595#post_2",
  "publishedAt": "2026-04-28T15:22:37.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LiteRT / TFLite post-training integer quantization",
    "TensorFlow tf.lite.RepresentativeDataset",
    "LiteRT 8-bit quantization specification",
    "LiteRT post-training integer quantization",
    "Quantization Debugger",
    "Optimum TFLite exporter overview",
    "TFLite export guide",
    "Optimum ONNX export docs",
    "quantization guide",
    "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models",
    "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale",
    "I-BERT: Integer-only BERT Quantization",
    "Outlier Suppression: Pushing the Limit of Low-bit Transformer Quantization",
    "ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers",
    "Optimum ONNX export guide",
    "LiteRT 16x8 post-training integer quantization",
    "LiteRT Quantization Debugger guide",
    "TensorFlow tf.lite.experimental.QuantizationDebugger",
    "TensorFlow Model Optimization: quantization-aware training",
    "TensorFlow post-training quantization overview",
    "CTranslate2",
    "CTranslate2 Transformers guide",
    "CTranslate2 quantization",
    "ONNX Runtime quantization"
  ],
  "textContent": "For now, it seems to be a known complicated failure mode:\n\n* * *\n\n# Answer\n\nYes — this is a known failure class, but I would phrase it carefully.\n\nI would **not** describe it as:\n\n> TFLiteConverter officially does not support encoder-decoder seq2seq PTQ.\n\nThat is too broad.\n\nA more accurate statement is:\n\n> **Full INT8 post-training quantization with TFLiteConverter is not a robust, well-documented deployment path for a fused autoregressive encoder-decoder Transformer graph. Conversion success only proves that the graph was lowered to a TFLite flatbuffer; it does not prove that encoder-decoder conditioning survived quantization.**\n\nIn this case, the symptoms are much stronger than ordinary quantization degradation:\n\n  * BLEU drops from `23.9` to `0.04`.\n  * The model emits repeated tokens for essentially any input.\n  * The decoder appears to ignore the encoder from the first decoding step.\n  * The `INT16 activations / INT8 weights` path is not deployable because the target runtime rejects `TILE`.\n\n\n\nThat combination strongly suggests that full INT8 PTQ has damaged the **encoder-memory / decoder cross-attention path**. The converted model is structurally valid, but semantically broken.\n\n* * *\n\n# Why conversion success is misleading\n\nTFLite conversion answers a graph-lowering question:\n\n> Can this TensorFlow graph be represented as a TFLite model using the requested operator set?\n\nIt does **not** answer the more important deployment question:\n\n> Does the quantized model preserve the numerical behavior required for autoregressive seq2seq generation?\n\nThe LiteRT/TFLite full-integer quantization path uses a representative dataset to estimate ranges for variable tensors such as model inputs, outputs, and intermediate activations. See:\n\n  * LiteRT / TFLite post-training integer quantization\n  * TensorFlow tf.lite.RepresentativeDataset\n  * LiteRT 8-bit quantization specification\n\n\n\nFor image classifiers, “representative data” often means representative images. For seq2seq generation, that is not enough. The representative dataset must exercise the real generation states:\n\n  * source token lengths,\n  * source attention masks,\n  * decoder prefix lengths,\n  * decoder masks,\n  * forced BOS / language tokens,\n  * early decoding,\n  * middle decoding,\n  * near-EOS decoding,\n  * cross-attention activation ranges,\n  * final-logit ranges.\n\n\n\nIf the converter only calibrates a narrow graph path, it can choose bad INT8 scales for tensors that are critical during real decoding.\n\nThat is how you can get:\n\n\n    conversion succeeds\n    +\n    runtime does not crash\n    +\n    outputs are completely wrong\n\n\n* * *\n\n# Why this looks like cross-attention failure\n\nAn encoder-decoder Transformer has two main parts:\n\n\n    source tokens\n    → encoder\n    → encoder hidden states\n    → decoder cross-attention\n    → decoder hidden states\n    → logits\n    → output tokens\n\n\nThe decoder uses the source input mainly through **cross-attention**. If that path is corrupted, the decoder still has:\n\n  * target-side token embeddings,\n  * decoder self-attention,\n  * learned language-model priors,\n  * LM-head bias,\n  * BOS / forced-token priors,\n  * common-token frequency bias.\n\n\n\nSo the model can still generate tokens. But the output becomes weakly conditioned or unconditioned. Typical symptoms are:\n\n  * same-ish output for different inputs,\n  * repeated tokens,\n  * generic high-priority tokens,\n  * early collapse,\n  * near-zero BLEU,\n  * first-step logits that barely change across source inputs.\n\n\n\nThat matches the described behavior.\n\nThe most suspicious region is:\n\n\n    encoder_hidden_states\n    → cross-attention key/value projections\n    → attention score/value path\n    → cross-attention output projection\n    → residual / LayerNorm-adjacent tensors\n    → LM-head logits\n\n\nThe first decoding step is especially diagnostic. At step 1, the decoder has almost no target-side history. If the INT8 model is already source-insensitive at step 1, the problem is probably not beam search, repetition penalty, EOS handling, or long-run generation logic. It is likely the encoder-memory path or the first decoder cross-attention block.\n\n* * *\n\n# Is this a known limitation?\n\nIn practical terms, yes.\n\nThe exact sentence “TFLiteConverter PTQ does not support encoder-decoder seq2seq” is not the usual official wording. But the documented pieces line up:\n\n  * TFLite full-integer PTQ depends on representative activation calibration: LiteRT post-training integer quantization.\n  * TFLite provides a Quantization Debugger specifically because full-integer quantization can produce unexpectedly poor or completely wrong results.\n  * Hugging Face’s Optimum TFLite exporter overview lists mostly encoder-style architectures such as BERT, RoBERTa, DistilBERT, MobileBERT, MPNet, and related models. It does not present full autoregressive encoder-decoder generation as the obvious happy path.\n  * Optimum’s TFLite export guide notes that static input shapes need to be specified.\n  * Hugging Face’s Optimum ONNX export docs describe encoder-decoder export using separate encoder and decoder pieces, because the encoder runs once and the decoder runs repeatedly during autoregressive generation.\n  * ONNX Runtime’s quantization guide says dynamic quantization is generally recommended for RNNs and Transformer-based models, while static quantization is generally recommended for CNNs.\n\n\n\nThat last point is especially relevant. Your hardware requires a static full-INT8-style artifact, but Transformer generation is one of the model families where static activation calibration is most fragile.\n\nSo the practical answer is:\n\n> **This is a known class of PTQ failure: a valid full-INT8 TFLite model can be generated, but the quantized activations can destroy the conditioning path that makes encoder-decoder generation work.**\n\n* * *\n\n# Why Transformers are hard for generic INT8 PTQ\n\nTransformer quantization is difficult mainly because the activations are difficult.\n\nThe literature around Transformer quantization repeatedly points to activation outliers and attention/LayerNorm sensitivity:\n\n  * SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models explains that weights are relatively easy to quantize, while activations are harder because of outliers. SmoothQuant migrates quantization difficulty from activations to weights.\n  * LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale isolates outlier feature dimensions because they dominate Transformer behavior.\n  * I-BERT: Integer-only BERT Quantization shows that integer-only Transformer inference needs special handling for GELU, Softmax, and LayerNorm.\n  * Outlier Suppression: Pushing the Limit of Low-bit Transformer Quantization focuses on suppressing activation outliers in Transformer quantization.\n  * ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers combines PTQ, hardware-aware quantization, and layer-by-layer distillation.\n\n\n\nPlain `TFLiteConverter` PTQ is much more generic than these methods. It does not automatically perform SmoothQuant-style activation smoothing, LLM.int8-style outlier routing, or I-BERT-style integer Transformer operator redesign.\n\nThat matters because a fused encoder-decoder generation graph contains exactly the fragile pieces:\n\n\n    MatMul / BatchMatMul\n    Softmax\n    LayerNorm-adjacent tensors\n    residual additions\n    attention masks\n    cross-attention K/V projections\n    final vocabulary projection\n\n\nA single bad scale around cross-attention can make the decoder appear source-blind.\n\n* * *\n\n# Why the fused graph is probably the wrong deployment shape\n\nA fused encoder-decoder generation graph is the least debuggable shape for this problem.\n\nSeq2seq inference naturally looks like this:\n\n\n    1. Run encoder once.\n    2. Repeatedly run decoder for each output token.\n    3. Select the next token outside the model.\n    4. Stop on EOS or max length.\n\n\nThe usual deployment structure is therefore:\n\n\n    encoder model:\n      input_ids, attention_mask\n      → encoder_hidden_states\n\n    decoder-step model:\n      decoder_input_ids, decoder_attention_mask, encoder_hidden_states, encoder_attention_mask\n      → next-token logits\n\n\nThen the host application runs greedy search, beam search, EOS handling, and repetition logic outside the model.\n\nThis is also the shape used by common seq2seq export/deployment tooling. For example, Hugging Face’s Optimum ONNX export guide discusses decoder export with past key/value reuse because the decoder runs repeatedly during autoregressive generation.\n\nA fused graph often hides too much:\n\n\n    encoder\n    decoder\n    decoder loop\n    mask updates\n    shape operations\n    possibly beam expansion\n    possibly TILE\n    token selection\n    EOS handling\n\n\nThat makes all of these harder:\n\n  * calibration,\n  * static-shape control,\n  * operator support,\n  * delegate partitioning,\n  * cross-attention inspection,\n  * first-step source-sensitivity testing,\n  * quantized boundary debugging.\n\n\n\nFor this case, I would not keep pushing the fused graph as the primary production path.\n\n* * *\n\n# Recommended working approach\n\nThe most realistic path forward is:\n\n\n    encoder_int8.tflite\n    +\n    decoder_step_int8.tflite\n    +\n    host-side generation loop\n\n\nDo not export `generate()` as one fused TFLite graph unless there is no alternative.\n\n## Target layout\n\n\n    encoder.tflite\n\n    inputs:\n      input_ids: int32\n      attention_mask: int32\n\n    outputs:\n      encoder_hidden_states: int8\n\n\n\n    decoder_step.tflite\n\n    inputs:\n      decoder_input_ids: int32\n      decoder_attention_mask: int32\n      encoder_hidden_states: int8\n      encoder_attention_mask: int32\n\n    outputs:\n      logits: int8\n\n\nHost-side decoding:\n\n\n    encoder_states = run_encoder(input_ids, attention_mask)\n\n    decoder_ids = [decoder_start_token_id]\n\n    for step in range(max_new_tokens):\n        logits = run_decoder_step(\n            decoder_input_ids=decoder_ids,\n            decoder_attention_mask=make_decoder_mask(decoder_ids),\n            encoder_hidden_states=encoder_states,\n            encoder_attention_mask=attention_mask,\n        )\n\n        next_id = select_next_token(logits)\n        decoder_ids.append(next_id)\n\n        if next_id == eos_token_id:\n            break\n\n\nThis structure gives you a way to test each boundary:\n\n\n    FP32 encoder → FP32 decoder\n    INT8 encoder → FP32 decoder\n    FP32 encoder → INT8 decoder\n    INT8 encoder → INT8 decoder\n    INT8 encoder → INT8 decoder on hardware delegate\n\n\nThat isolates whether the failure comes from:\n\n  * encoder quantization,\n  * decoder quantization,\n  * the encoder-output / decoder-input boundary,\n  * cross-attention,\n  * logits,\n  * or the delegate.\n\n\n\n* * *\n\n# The hard part: quantized encoder/decoder boundary\n\nIf you split the graph, the encoder output and decoder input may have different quantization parameters.\n\nExample:\n\n\n    encoder output:\n      scale_e\n      zero_point_e\n\n    decoder encoder_hidden_states input:\n      scale_d\n      zero_point_d\n\n\nYou cannot blindly pass raw INT8 bytes from the encoder output into the decoder input unless the quantization parameters match.\n\nIf they differ, you need an explicit requantization bridge:\n\n\n    real_value = scale_e * (q_e - zero_point_e)\n    q_d = round(real_value / scale_d + zero_point_d)\n    q_d = clamp(q_d, -128, 127)\n\n\nThis boundary is important. A broken boundary can produce exactly the same symptom as broken cross-attention: the decoder runs but receives meaningless encoder memory.\n\nFor debugging, temporarily test these variants:\n\n\n    FP32 encoder output → FP32 decoder\n    INT8 encoder output → dequantized float → FP32 decoder\n    FP32 encoder output → quantized decoder input → INT8 decoder\n    INT8 encoder output → requantized decoder input → INT8 decoder\n\n\nOnly the final variant is close to strict deployment, but the intermediate variants tell you where the information is lost.\n\n* * *\n\n# Calibration strategy\n\nThe representative dataset must cover actual generation states.\n\nDo not calibrate only source inputs.\n\nDo not calibrate only BOS.\n\nDo not calibrate only full teacher-forced targets if deployment uses step-by-step decoding.\n\nA better calibration set should include multiple decoder prefixes per source example.\n\n## Bad calibration pattern\n\n\n    def representative_dataset():\n        for batch in source_batches:\n            yield {\n                \"input_ids\": batch[\"input_ids\"],\n                \"attention_mask\": batch[\"attention_mask\"],\n            }\n\n\nThat may calibrate the encoder path but not the decoder cross-attention behavior used during generation.\n\n## Better calibration pattern\n\n\n    def representative_dataset():\n        for src_text, tgt_text in calibration_pairs:\n            src = source_tokenizer(\n                src_text,\n                max_length=SRC_LEN,\n                padding=\"max_length\",\n                truncation=True,\n                return_tensors=\"np\",\n            )\n\n            tgt = target_tokenizer(\n                tgt_text,\n                max_length=TGT_LEN,\n                padding=False,\n                truncation=True,\n                return_tensors=\"np\",\n            )\n\n            target_ids = tgt[\"input_ids\"][0]\n\n            for prefix_len in [1, 2, 4, 8, 16, 32]:\n                if prefix_len > len(target_ids):\n                    continue\n\n                decoder_prefix = target_ids[:prefix_len]\n                decoder_prefix = pad_to_length(\n                    decoder_prefix,\n                    length=DECODER_PREFIX_LEN,\n                    pad_id=target_pad_id,\n                )\n\n                yield {\n                    \"input_ids\": src[\"input_ids\"].astype(\"int32\"),\n                    \"attention_mask\": src[\"attention_mask\"].astype(\"int32\"),\n                    \"decoder_input_ids\": decoder_prefix[None, :].astype(\"int32\"),\n                    \"decoder_attention_mask\": (decoder_prefix[None, :] != target_pad_id).astype(\"int32\"),\n                }\n\n\nThe exact input names must match the SavedModel signature.\n\n## Calibration coverage checklist\n\nInclude:\n\n\n    short source examples\n    normal source examples\n    long source examples\n    max-length source examples\n    padding-heavy examples\n    near-no-padding examples\n    rare names and numerals\n    punctuation-heavy examples\n    domain-specific examples\n    BOS / forced decoder-start token\n    early decoder prefix\n    middle decoder prefix\n    near-EOS decoder prefix\n\n\nA useful rule of thumb:\n\n\n    200 source examples × 5 decoder prefixes\n\n\nis usually more informative than:\n\n\n    1000 source examples × only BOS\n\n\nbecause the former covers more activation regimes.\n\n* * *\n\n# Converter configuration advice\n\nThere is probably no single converter flag that fixes this.\n\nStill, I would run these baselines.\n\n## 1. Float TFLite baseline\n\n\n    converter = tf.lite.TFLiteConverter.from_saved_model(\"<saved_model_dir>\")\n    tflite_model = converter.convert()\n\n    with open(\"<model_float>.tflite\", \"wb\") as f:\n        f.write(tflite_model)\n\n\nIf this fails, stop. The issue is export/lowering, not INT8.\n\n## 2. Dynamic-range baseline\n\n\n    converter = tf.lite.TFLiteConverter.from_saved_model(\"<saved_model_dir>\")\n    converter.optimizations = [tf.lite.Optimize.DEFAULT]\n\n    tflite_model = converter.convert()\n\n    with open(\"<model_dynamic_range>.tflite\", \"wb\") as f:\n        f.write(tflite_model)\n\n\nIf dynamic-range quantization works while full INT8 fails, weights are probably not the main problem. The problem is activation quantization.\n\n## 3. Full INT8 baseline\n\n\n    converter = tf.lite.TFLiteConverter.from_saved_model(\"<saved_model_dir>\")\n    converter.optimizations = [tf.lite.Optimize.DEFAULT]\n    converter.representative_dataset = representative_dataset\n    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]\n\n    tflite_model = converter.convert()\n\n    with open(\"<model_int8>.tflite\", \"wb\") as f:\n        f.write(tflite_model)\n\n\nBe careful with:\n\n\n    converter.inference_input_type = tf.int8\n    converter.inference_output_type = tf.int8\n\n\nFor text models, token IDs are categorical integer indices, not numeric image/audio activations. `input_ids` and masks often remain `int32`. Do not blindly force token IDs to INT8.\n\nAlways inspect the converted model:\n\n\n    interpreter = tf.lite.Interpreter(model_path=\"<model_int8>.tflite\")\n    interpreter.allocate_tensors()\n\n    print(\"Inputs:\")\n    for item in interpreter.get_input_details():\n        print(item[\"name\"], item[\"dtype\"], item[\"shape\"], item[\"quantization\"])\n\n    print(\"Outputs:\")\n    for item in interpreter.get_output_details():\n        print(item[\"name\"], item[\"dtype\"], item[\"shape\"], item[\"quantization\"])\n\n\nIf the final logits are INT8, the host decoder must respect the output tensor’s scale and zero point.\n\nFor greedy argmax, quantized argmax is often equivalent if all logits share one scale and zero point. For beam search, length penalty, temperature, top-k, or probability arithmetic, dequantization or careful fixed-point handling is safer.\n\n* * *\n\n# About `inference_output_type=tf.float32`\n\nThis line is suspicious:\n\n\n    converter.inference_output_type = tf.float32\n\n\nIt is not necessarily the root cause of the collapse, but it is worth testing without it.\n\nIf the target is a strict INT8 hardware delegate, leaving a float output can create an awkward quantize/dequantize boundary or a partially non-integer interface. That may be acceptable for debugging, but it is not ideal for a strict integer deployment.\n\nHowever, the repeated-token collapse is more likely caused by an internal activation/cross-attention quantization problem than by the output type alone.\n\nI would test both:\n\n\n    # Debug-friendly interface\n    converter.inference_output_type = tf.float32\n\n\nand:\n\n\n    # Strict integer numeric output, if compatible with your graph interface\n    converter.inference_output_type = tf.int8\n\n\nThen compare:\n\n\n    full INT8 CPU output\n    full INT8 delegate output\n    first-step source sensitivity\n    BLEU\n\n\n* * *\n\n# Why 16x8 is useful even though it is not deployable\n\nThe experimental mode:\n\n\n    tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8\n\n\nis useful diagnostically because it tests whether INT8 activations are the problem.\n\nIf 16x8 improves quality but the runtime rejects `TILE`, the interpretation is:\n\n\n    The model likely needs more activation precision.\n    The target delegate cannot execute the more accurate path.\n\n\nLiteRT documents 16-bit activations with 8-bit weights as an option that can help when activations are sensitive, but optimized kernel/delegate support is more limited than ordinary INT8. See:\n\n  * LiteRT 16x8 post-training integer quantization\n\n\n\nSo the `TILE` problem is not surprising. It is a runtime/delegate support failure, not proof that plain INT8 PTQ should work.\n\n* * *\n\n# The first diagnostic test I would run\n\nBefore doing more BLEU evaluation, run a first-step source-sensitivity test.\n\nPick two very different source sentences:\n\n\n    source A: \"The committee approved the budget after three hours of debate.\"\n    source B: \"The patient developed a fever after the second injection.\"\n\n\nUse the same decoder prefix:\n\n\n    decoder_input_ids = [decoder_start_token_id]\n\n\nCompare:\n\n\n    FP32(source A, BOS) → logits_A_fp32\n    FP32(source B, BOS) → logits_B_fp32\n\n    INT8(source A, BOS) → logits_A_int8\n    INT8(source B, BOS) → logits_B_int8\n\n\nHealthy behavior:\n\n\n    FP32 logits differ by source.\n    INT8 logits also differ by source.\n\n\nBroken source-blind behavior:\n\n\n    FP32 logits differ by source.\n    INT8 logits are nearly identical across sources.\n\n\nExample helper:\n\n\n    import numpy as np\n\n    def topk_ids(logits, k=10):\n        flat = np.asarray(logits).reshape(-1)\n        return np.argsort(flat)[-k:][::-1]\n\n    def compare_logits(logits_a, logits_b, k=10):\n        a = np.asarray(logits_a).reshape(-1).astype(np.float64)\n        b = np.asarray(logits_b).reshape(-1).astype(np.float64)\n\n        top_a = topk_ids(a, k)\n        top_b = topk_ids(b, k)\n\n        cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)\n\n        return {\n            \"argmax_a\": int(top_a[0]),\n            \"argmax_b\": int(top_b[0]),\n            \"same_argmax\": bool(top_a[0] == top_b[0]),\n            \"topk_overlap\": len(set(top_a.tolist()) & set(top_b.tolist())),\n            \"cosine\": float(cosine),\n            \"range_a\": float(a.max() - a.min()),\n            \"range_b\": float(b.max() - b.min()),\n            \"top_a\": top_a.tolist(),\n            \"top_b\": top_b.tolist(),\n        }\n\n\nThis is more informative than BLEU.\n\nBLEU tells you the model is broken. First-step source sensitivity tells you whether the encoder context is already gone at the first decoder step.\n\n* * *\n\n# Use Quantization Debugger\n\nUse TFLite’s Quantization Debugger to identify where error first explodes:\n\n  * LiteRT Quantization Debugger guide\n  * TensorFlow tf.lite.experimental.QuantizationDebugger\n\n\n\nInspect these tensors first:\n\n\n    encoder output\n    decoder layer 0 cross-attention query\n    decoder layer 0 cross-attention key\n    decoder layer 0 cross-attention value\n    decoder layer 0 cross-attention scores\n    decoder layer 0 cross-attention output\n    post-cross-attention residual\n    final decoder hidden state\n    LM-head logits\n\n\nLook for:\n\nObservation | Likely meaning\n---|---\nEncoder hidden states saturated | Encoder output quantization is bad\nCross-attention K/V nearly constant | Source memory is destroyed\nAttention scores nearly constant | Decoder cannot select source positions\nAttention scores extreme | Softmax collapses\nCross-attention output near zero | Source signal is muted\nResidual dominates attention output | Encoder signal is drowned\nLogits almost identical across sources | Decoder is source-blind\nLogits saturated | Final projection/output scale problem\n\nSelective quantization can also be useful diagnostically. For example, leave one region float and see whether BLEU recovers:\n\n\n    leave encoder output float\n    leave cross-attention K/V projections float\n    leave attention score path float\n    leave post-cross-attention residual float\n    leave LM head float\n\n\nThis may not be deployable on an INT8-only delegate, but it can identify the tensor group that kills the model.\n\n* * *\n\n# Full diagnostic ladder\n\nRun the same evaluation set through these variants:\n\nVariant | Purpose | Interpretation\n---|---|---\nOriginal FP32 TensorFlow / Transformers | Reference | Should reproduce BLEU around `23.9`\nFloat TFLite CPU | Export/lowering check | If bad, quantization is not the first problem\nDynamic-range TFLite CPU | Weight-quantization check | If good, weights are not the main issue\nFull INT8 TFLite CPU | Quantization check | If bad, calibration/numerics are failing\nFull INT8 TFLite delegate | Runtime check | If CPU good but delegate bad, runtime/delegate is failing\n16x8 TFLite CPU, if possible | Activation-precision check | If better, INT8 activations are the bottleneck\n\nThe key split is:\n\n\n    float TFLite bad\n    → export/lowering/fused-graph issue\n\n    float TFLite good, INT8 CPU bad\n    → quantization/calibration issue\n\n    INT8 CPU good, INT8 delegate bad\n    → delegate/operator/kernel issue\n\n    16x8 better than INT8\n    → activation precision issue\n\n\n* * *\n\n# What to do if split PTQ still fails\n\nIf the split encoder/decoder-step model still collapses after proper calibration, the realistic options are:\n\n## 1. Quantization-aware training\n\nUse QAT if PTQ cannot meet the accuracy target.\n\nRelevant docs:\n\n  * TensorFlow Model Optimization: quantization-aware training\n  * TensorFlow post-training quantization overview\n\n\n\nImportant: do QAT on the deployment-shaped graph, not only on the original training graph.\n\nThat means:\n\n\n    same max source length\n    same decoder-step shape\n    same masks\n    same BOS/EOS behavior\n    same tokenizer\n    same target delegate constraints\n    same quantized encoder/decoder boundary\n\n\n## 2. Distillation into a quantization-friendly model\n\nIf the original architecture is too sensitive, distill into a smaller model designed for the target constraints:\n\n\n    fixed source length\n    fixed decoder-step shape\n    simpler attention pattern\n    no fused generation graph\n    delegate-supported ops only\n    QAT or PTQ-aware evaluation from the beginning\n\n\n## 3. Runtime change, if possible\n\nIf the target can change, use a Transformer-native runtime instead of generic fused TFLite.\n\nUseful references:\n\n  * CTranslate2\n  * CTranslate2 Transformers guide\n  * CTranslate2 quantization\n  * ONNX Runtime quantization\n\n\n\nCTranslate2 supports many encoder-decoder Transformer families and quantization modes. Even if it cannot be shipped on the final target, it is useful as a sanity check:\n\n\n    If CTranslate2 INT8 works but TFLite INT8 collapses,\n    the model is probably quantizable,\n    but the current TFLite path is not preserving it.\n\n\n## 4. Requirement change\n\nIf the hardware delegate truly requires plain full INT8 TFLite and the model cannot survive that path, the requirement may be incompatible with the model family.\n\nPossible requirement changes:\n\n\n    allow int16 activations\n    allow selected float fallback\n    allow a custom op\n    allow a different runtime\n    allow a smaller/distilled model\n    allow server-side inference\n\n\n* * *\n\n# What I would not spend time on\n\n## Blindly adding more calibration samples\n\nMore data does not fix the wrong calibration distribution.\n\nBad:\n\n\n    1000 source examples × BOS only\n\n\nBetter:\n\n\n    200 source examples × multiple decoder prefixes\n\n\n## Blindly trying converter flags\n\nConverter flags are secondary. The primary issue is graph shape and activation calibration.\n\n## Assuming `TILE` is the root cause of BLEU collapse\n\n`TILE` explains why the 16x8 path is not viable on the target. It does not by itself explain why full INT8 repeats tokens. These are related deployment constraints, but not the same failure.\n\n## Assuming the converter understands generation semantics\n\nThe converter lowers tensors and ops. It does not know that a certain tensor is “encoder memory that must preserve source conditioning.”\n\n* * *\n\n# My final recommendation\n\nI would proceed like this:\n\n  1. **Do not keep the fused graph as the main production candidate.**\n  2. Build a **float TFLite baseline** and verify it matches the original model.\n  3. Build a **dynamic-range TFLite baseline**.\n  4. Split into:\n     * `encoder.tflite`\n     * `decoder_step.tflite`\n  5. Calibrate the decoder-step model using **real decoder prefixes across multiple timesteps**.\n  6. Run **full INT8 CPU** before using the hardware delegate.\n  7. Run the **first-step source-sensitivity test**.\n  8. Use **Quantization Debugger** around encoder output and decoder cross-attention.\n  9. Explicitly handle the **quantized encoder-output / decoder-input boundary**.\n  10. If PTQ still collapses, move to **QAT or distillation**.\n  11. If CPU INT8 works but the delegate fails, treat it as a **delegate/operator support problem**.\n\n\n\nThe concise diagnosis is:\n\n> The converted model is probably not failing because TFLite cannot tokenize, decode, or run the graph at all. It is failing because full INT8 static PTQ has destroyed the numerical path that carries encoder information into decoder cross-attention. The decoder still emits tokens, but it no longer receives useful source context, so it falls back to repeated high-prior tokens and BLEU collapses.\n\n* * *\n\n# Short answer\n\n  * **Yes** , this is a known full-INT8 PTQ failure class.\n  * **No** , there is probably not one converter flag that fixes a fused encoder-decoder generation graph.\n  * **The likely broken region is cross-attention or the encoder-hidden-state boundary.**\n  * **The recommended deployment shape is split encoder + decoder-step, with generation outside TFLite.**\n  * **Calibration must include real decoder prefixes across timesteps, not only source inputs.**\n  * **Use float TFLite, dynamic-range TFLite, full INT8 CPU, and full INT8 delegate as separate baselines.**\n  * **Use first-step source-sensitivity tests and Quantization Debugger before relying only on BLEU.**\n  * **If careful split PTQ still fails, use QAT, distillation, or a different runtime/precision target.**\n\n",
  "title": "PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion"
}