Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreids7n2uzt7u44zuj54kxgpgzqlkw4xxp525mtca3w6widihl76vta",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlxfwrse2fx2"
  },
  "path": "/t/ptq-int8-via-tfliteconverter-encoder-decoder-seq2seq-model-loses-encoder-context-entirely-after-conversion/175595#post_4",
  "publishedAt": "2026-05-16T07:32:23.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "LiteRT / TFLite post-training integer quantization",
    "LiteRT post-training quantization overview",
    "LiteRT Quantization Debugger",
    "LiteRT 8-bit quantization specification",
    "Optimum ONNX encoder-decoder export guide",
    "Optimum TFLite export guide",
    "ONNX Runtime quantization docs",
    "TensorFlow Model Optimization QAT guide",
    "TensorFlow Model Optimization comprehensive QAT guide",
    "CTranslate2",
    "CTranslate2 quantization docs",
    "Optimum ONNX export guide",
    "Optimum ONNX export functions",
    "LiteRT post-training quantization",
    "LiteRT post-training integer quantization",
    "TensorFlow tf.lite.experimental.QuantizationDebugger",
    "TensorFlow comprehensive QAT guide",
    "TensorFlow Model Optimization QuantizeConfig",
    "TensorFlow Model Optimization: QAT support for MultiHeadAttention",
    "TFLite quantized MultiHeadAttention issue",
    "TFLite Micro quantized Softmax zero-point issue",
    "LiteRT Torch LayerNorm full-INT8 issue",
    "LiteRT 16x8 post-training integer quantization",
    "CTranslate2 GitHub",
    "CTranslate2 Transformers guide",
    "CTranslate2 quantization"
  ],
  "textContent": "I can’t find a single real-world example of this working “as-is” through a search…\n\n* * *\n\n# Are there any real solutions for full-INT8 TFLite seq2seq Transformer deployment?\n\nShort answer: **yes, but not as a simple`TFLiteConverter` flag**.\n\nFor a Hugging Face-style encoder-decoder Transformer such as T5, MarianMT, BART, mBART, Pegasus, M2M100, or NLLB, the realistic solution is not:\n\n\n    converter.optimizations = [tf.lite.Optimize.DEFAULT]\n    converter.representative_dataset = representative_dataset\n    converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]\n\n\nand done.\n\nThat path can produce a valid `.tflite` file while the decoder becomes numerically useless. The likely reason is that the decoder’s **cross-attention path** is not being calibrated correctly. The encoder can quantize cleanly, while the decoder loses source conditioning and starts producing repeated tokens, random tokens, empty strings, or nonsensical translations.\n\nThe most realistic path is:\n\n\n    encoder_int8.tflite\n    +\n    decoder_step_int8.tflite\n    +\n    host-side generation loop\n    +\n    explicit decoder calibration\n    +\n    explicit encoder→decoder quantized boundary handling\n    +\n    Quantization Debugger\n    +\n    possibly decoder-step QAT or a custom attention delegate\n\n\nThere is probably no public turnkey recipe for this exact target today.\n\n* * *\n\n## Current state of the field\n\nI would summarize the situation like this:\n\n> Full-INT8 TFLite deployment of a Hugging Face-style encoder-decoder Transformer decoder is not a mature public path. There are good public resources for TFLite INT8 in general, good public resources for ONNX/CTranslate2 seq2seq deployment, and good research on Transformer quantization. But I could not find a validated public example of a T5/MarianMT/BART-style `encoder.tflite + decoder_step.tflite` full-INT8 deployment with working decoder cross-attention and custom delegate execution.\n\nUseful references:\n\n  * LiteRT / TFLite post-training integer quantization\n  * LiteRT post-training quantization overview\n  * LiteRT Quantization Debugger\n  * LiteRT 8-bit quantization specification\n  * Optimum ONNX encoder-decoder export guide\n  * Optimum TFLite export guide\n  * ONNX Runtime quantization docs\n  * TensorFlow Model Optimization QAT guide\n  * TensorFlow Model Optimization comprehensive QAT guide\n  * CTranslate2\n  * CTranslate2 quantization docs\n\n\n\n* * *\n\n## Why the fused PTQ path fails\n\nA fused seq2seq graph hides too much.\n\nA seq2seq Transformer naturally runs like this:\n\n\n    1. Run encoder once.\n    2. Repeatedly run decoder for each generated token.\n    3. Select the next token outside the model.\n    4. Stop on EOS or max length.\n\n\nThe decoder uses the source through cross-attention:\n\n\n    decoder hidden state → Q\n    encoder hidden state → K, V\n\n    attention_scores = Q @ K.T\n    attention_probs = softmax(attention_scores + mask)\n    context = attention_probs @ V\n\n\nIf INT8 quantization corrupts this path, the decoder can still emit tokens because it still has:\n\n  * decoder token embeddings,\n  * decoder self-attention,\n  * learned language-model priors,\n  * LM-head bias,\n  * forced BOS/language-token priors.\n\n\n\nBut it no longer receives useful source information. That produces:\n\n\n    same-ish output for unrelated sources\n    repeated tokens\n    empty strings\n    random tokens\n    nonsensical translations\n    BLEU collapse\n\n\nThat is not ordinary quantization loss. That is source-conditioning failure.\n\n* * *\n\n## Why encoder-only calibration is insufficient\n\nTFLite full integer quantization depends on representative data to calibrate activation ranges.\n\nFor a decoder, representative data must cover the decoder state distribution, not only source inputs.\n\nBad calibration:\n\n\n    def representative_dataset():\n        for source in sources:\n            yield {\n                \"input_ids\": source[\"input_ids\"],\n                \"attention_mask\": source[\"attention_mask\"],\n            }\n\n\nThat mostly calibrates the encoder path.\n\nA decoder needs calibration samples like:\n\n\n    decoder_input_ids\n    decoder_attention_mask\n    encoder_hidden_states\n    encoder_attention_mask\n\n\nand those samples must represent real generation states:\n\n\n    BOS-only prefix\n    early target prefix\n    middle target prefix\n    near-EOS prefix\n    short source\n    long source\n    padding-heavy source\n    near-no-padding source\n    names / numbers / rare tokens\n    domain-specific examples\n\n\nA better calibration strategy is:\n\n\n    200 source examples × 5 decoder prefixes\n\n\nnot:\n\n\n    1000 source examples × encoder only\n\n\nThe issue is not just dataset size. It is whether the decoder cross-attention tensors are ever exercised with realistic activation ranges.\n\n* * *\n\n# Solution 1: Split the graph\n\nThe first serious solution is to stop trying to deploy the fused graph.\n\nDo not make this the production target:\n\n\n    fused_seq2seq_int8.tflite\n\n\nUse this instead:\n\n\n    encoder_int8.tflite\n    decoder_step_int8.tflite\n    host_generation_loop\n\n\nThis matches the architecture used by mature seq2seq export flows. Hugging Face Optimum’s ONNX path explicitly handles encoder-decoder generation by separating encoder and decoder behavior, including decoder past-key-value reuse for autoregressive generation:\n\n  * Optimum ONNX export guide\n  * Optimum ONNX export functions\n\n\n\nTarget layout:\n\n\n    encoder_int8.tflite\n\n    inputs:\n      input_ids: int32\n      attention_mask: int32\n\n    outputs:\n      encoder_hidden_states: int8\n\n\n\n    decoder_step_int8.tflite\n\n    inputs:\n      decoder_input_ids: int32\n      decoder_attention_mask: int32\n      encoder_hidden_states: int8\n      encoder_attention_mask: int32\n\n    outputs:\n      logits: int8\n\n\nHost-side generation:\n\n\n    encoder_states = run_encoder(input_ids, attention_mask)\n\n    decoder_ids = [decoder_start_token_id]\n\n    for step in range(max_new_tokens):\n        logits = run_decoder_step(\n            decoder_input_ids=decoder_ids,\n            decoder_attention_mask=make_decoder_mask(decoder_ids),\n            encoder_hidden_states=encoder_states,\n            encoder_attention_mask=attention_mask,\n        )\n\n        next_id = select_next_token(logits)\n        decoder_ids.append(next_id)\n\n        if next_id == eos_token_id:\n            break\n\n\nThis does not automatically fix quantization, but it makes the problem debuggable.\n\n* * *\n\n# Solution 2: Build decoder-specific representative data\n\nThe decoder representative dataset must feed the decoder signature directly.\n\nConceptual decoder calibration:\n\n\n    def representative_decoder_dataset():\n        for src_text, tgt_text in calibration_pairs:\n            encoder_inputs = tokenize_source(src_text)\n\n            # For debugging:\n            #   Use FP32 encoder states.\n            #\n            # For deployment fidelity:\n            #   Use quantized encoder states plus the real encoder→decoder requantization bridge.\n            encoder_hidden_states = run_encoder_for_calibration(encoder_inputs)\n\n            target_ids = tokenize_target(tgt_text)\n\n            for prefix_len in [1, 2, 4, 8, 16, 32]:\n                if prefix_len > len(target_ids):\n                    continue\n\n                prefix = target_ids[:prefix_len]\n                prefix = pad_to_static_length(prefix, DECODER_LEN)\n\n                yield {\n                    \"decoder_input_ids\": prefix.astype(\"int32\"),\n                    \"decoder_attention_mask\": make_decoder_mask(prefix).astype(\"int32\"),\n                    \"encoder_hidden_states\": encoder_hidden_states,\n                    \"encoder_attention_mask\": encoder_inputs[\"attention_mask\"].astype(\"int32\"),\n                }\n\n\nIf your SavedModel has multiple signatures, the representative dataset can conceptually be split by signature:\n\n\n    def representative_dataset():\n        for batch in encoder_calibration_batches:\n            yield (\n                \"encode\",\n                {\n                    \"input_ids\": batch[\"input_ids\"],\n                    \"attention_mask\": batch[\"attention_mask\"],\n                },\n            )\n\n        for batch in decoder_calibration_batches:\n            yield (\n                \"decode\",\n                {\n                    \"decoder_input_ids\": batch[\"decoder_input_ids\"],\n                    \"decoder_attention_mask\": batch[\"decoder_attention_mask\"],\n                    \"encoder_hidden_states\": batch[\"encoder_hidden_states\"],\n                    \"encoder_attention_mask\": batch[\"encoder_attention_mask\"],\n                },\n            )\n\n\nRelevant docs:\n\n  * LiteRT post-training quantization\n  * LiteRT post-training integer quantization\n\n\n\nThe key idea:\n\n> The decoder must be calibrated as a decoder, not as a side effect of encoder input calibration.\n\n* * *\n\n# Solution 3: Handle the encoder→decoder quantized boundary\n\nIf the encoder and decoder are separate TFLite models, the boundary can break the model even if both models are individually valid.\n\nThe encoder output and decoder input may have different quantization parameters:\n\n\n    encoder output:\n      scale_e\n      zero_point_e\n\n    decoder encoder_hidden_states input:\n      scale_d\n      zero_point_d\n\n\nYou cannot blindly pass raw `int8` bytes from the encoder output into the decoder input unless the quantization parameters match.\n\nIf they differ, requantize:\n\n\n    real_value = scale_e * (q_e - zero_point_e)\n    q_d = round(real_value / scale_d + zero_point_d)\n    q_d = clamp(q_d, -128, 127)\n\n\nDeployment-style boundary test matrix:\n\nEncoder | Boundary | Decoder | Meaning\n---|---|---|---\nFP32 | float | FP32 | Split-graph reference\nINT8 | dequantized float | FP32 | Tests encoder quality\nFP32 | quantized to decoder input | INT8 | Tests decoder quality\nINT8 | requantized | INT8 | Full deployment-like path\n\nIf this boundary is wrong, the symptom can look exactly like broken cross-attention:\n\n\n    decoder runs\n    but receives meaningless encoder memory\n\n\n* * *\n\n# Solution 4: Move cross-attention K/V projection to the encoder side\n\nThis is an architecture-level workaround.\n\nNormally, each decoder layer computes K/V from encoder hidden states:\n\n\n    K_i = W_k_i(encoder_hidden_states)\n    V_i = W_v_i(encoder_hidden_states)\n\n\nInstead, make the encoder-side artifact produce precomputed cross-attention memory:\n\n\n    encoder_int8.tflite\n\n    outputs:\n      cross_k_layer_0\n      cross_v_layer_0\n      cross_k_layer_1\n      cross_v_layer_1\n      ...\n\n\nThen make the decoder consume those tensors directly:\n\n\n    decoder_step_int8.tflite\n\n    inputs:\n      decoder_input_ids\n      decoder_attention_mask\n      cross_k_layer_0\n      cross_v_layer_0\n      cross_k_layer_1\n      cross_v_layer_1\n      ...\n\n\nWhy this can help:\n\n  * K/V are computed once, not every decoder step.\n  * K/V become explicit graph outputs/inputs.\n  * You can inspect their quantization parameters directly.\n  * You can design the encoder→decoder boundary around K/V instead of generic hidden states.\n  * The decoder graph becomes more predictable.\n\n\n\nTradeoff:\n\n\n    num_decoder_layers × 2 tensors\n\n\nYou get more interface complexity, but also much more control.\n\nThis is one of the most promising workarounds if the failure is specifically cross-attention K/V scale mismatch.\n\n* * *\n\n# Solution 5: Use first-step source-sensitivity testing\n\nBefore relying on BLEU, test whether the decoder still sees the source.\n\nUse two unrelated inputs:\n\n\n    source A: The committee approved the budget after a long debate.\n    source B: The patient developed a fever after the second injection.\n    decoder prefix: decoder_start_token_id\n\n\nCompare first-step logits:\n\n\n    FP32(source A, BOS) vs FP32(source B, BOS)\n    INT8(source A, BOS) vs INT8(source B, BOS)\n\n\nHealthy behavior:\n\n\n    FP32 logits differ across sources.\n    INT8 logits also differ across sources.\n\n\nBroken behavior:\n\n\n    FP32 logits differ across sources.\n    INT8 logits are nearly identical across sources.\n\n\nMinimal helper:\n\n\n    import numpy as np\n\n    def topk_ids(logits, k=10):\n        flat = np.asarray(logits).reshape(-1)\n        return np.argsort(flat)[-k:][::-1]\n\n    def compare_logits(logits_a, logits_b, k=10):\n        a = np.asarray(logits_a).reshape(-1).astype(np.float64)\n        b = np.asarray(logits_b).reshape(-1).astype(np.float64)\n\n        top_a = topk_ids(a, k)\n        top_b = topk_ids(b, k)\n\n        cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)\n\n        return {\n            \"argmax_a\": int(top_a[0]),\n            \"argmax_b\": int(top_b[0]),\n            \"same_argmax\": bool(top_a[0] == top_b[0]),\n            \"topk_overlap\": len(set(top_a.tolist()) & set(top_b.tolist())),\n            \"cosine\": float(cosine),\n            \"range_a\": float(a.max() - a.min()),\n            \"range_b\": float(b.max() - b.min()),\n            \"top_a\": top_a.tolist(),\n            \"top_b\": top_b.tolist(),\n        }\n\n\nThis test is more diagnostic than BLEU.\n\nBLEU tells you output quality is bad. First-step source sensitivity tells you whether the decoder lost encoder context immediately.\n\n* * *\n\n# Solution 6: Use Quantization Debugger and selective rescue\n\nUse the official Quantization Debugger to locate the first catastrophic tensor.\n\nRelevant docs:\n\n  * LiteRT Quantization Debugger\n  * TensorFlow tf.lite.experimental.QuantizationDebugger\n\n\n\nStart with decoder layer 0:\n\n\n    decoder embedding output\n    decoder self-attention output\n    cross-attention Q\n    cross-attention K\n    cross-attention V\n    QK^T attention scores\n    attention probabilities\n    attention_probs @ V context\n    cross-attention output projection\n    post-cross-attention residual\n    LM-head logits\n\n\nInterpretation table:\n\nObservation | Likely cause\n---|---\nK/V nearly constant | Encoder memory destroyed\nQ/K scales incompatible | Dot product corrupted\nAttention scores flat | Source selection lost\nAttention scores extreme | Softmax collapse\nContext vector near zero | Cross-attention muted\nResidual dominates context | Source signal drowned\nLogits same across source inputs | Decoder source-blind\nLogits saturated | Output scale problem\n\nSelective quantization is useful diagnostically:\n\n\n    leave K/V projections float\n    leave QK^T score path float\n    leave Softmax path float\n    leave cross-attention output projection float\n    leave post-cross-attention residual float\n    leave LM head float\n\n\nIf leaving a region float restores BLEU, that region is the failure point.\n\nThis may not be deployable on a strict INT8 delegate, but it tells you what must be fixed.\n\n* * *\n\n# Solution 7: Decoder-step QAT\n\nIf explicit decoder PTQ still fails, QAT is the next real TFLite-native option.\n\nRelevant docs:\n\n  * TensorFlow Model Optimization QAT guide\n  * TensorFlow comprehensive QAT guide\n  * TensorFlow Model Optimization QuantizeConfig\n\n\n\nDo not begin with the fused generation graph.\n\nBegin with:\n\n\n    decoder_step_qat_model\n\n\nInputs:\n\n\n    decoder_input_ids\n    decoder_attention_mask\n    encoder_hidden_states\n    encoder_attention_mask\n\n\nTarget:\n\n\n    next target token\n\n\nTraining objective:\n\n\n    teacher-forced next-token prediction\n\n\nPrefix sampling:\n\n\n    BOS\n    BOS + token 1\n    BOS + tokens 1..3\n    middle prefix\n    near-EOS prefix\n\n\nThe QAT graph must match deployment:\n\n\n    same source length\n    same decoder prefix length\n    same masks\n    same decoder_start_token_id behavior\n    same encoder_hidden_states boundary\n    same logits output convention\n    same supported operator set\n\n\nImportant caveat:\n\n> Transformer attention QAT in TensorFlow/TFLite is not necessarily turnkey.\n\nThere are public issues around QAT support for `MultiHeadAttention`, which is a warning that you may need a custom Keras decoder-step implementation, custom `QuantizeConfig`, or manual fake-quant insertion.\n\nRelevant issue:\n\n  * TensorFlow Model Optimization: QAT support for MultiHeadAttention\n\n\n\nPossible implementation routes:\n\n\n    custom decoder-step Keras model\n    custom QuantizeConfig\n    manual FakeQuant insertion\n    rewrite attention into quantizable primitives\n    train a smaller deployment-specific decoder\n\n\n* * *\n\n# Solution 8: Custom delegate or custom op for quantized cross-attention\n\nIf you own the hardware delegate, the most robust engineering solution may be to stop relying on generic TFLite decomposition for attention.\n\nImplement quantized cross-attention as a delegate-supported fused subgraph or custom op.\n\nA real quantized cross-attention implementation needs to control:\n\n\n    Q projection scale\n    K projection scale\n    V projection scale\n    QK^T accumulation scale\n    mask representation\n    Softmax approximation range\n    attention_probs scale\n    attention_probs @ V accumulation\n    context output scale\n    output projection scale\n    residual merge scale\n\n\nThis is much harder than “support INT8 matmul.”\n\nAttention contains:\n\n\n    FULLY_CONNECTED\n    RESHAPE\n    TRANSPOSE\n    BATCH_MATMUL\n    ADD / mask\n    SOFTMAX\n    BATCH_MATMUL\n    FULLY_CONNECTED\n    ADD / residual\n    possibly LayerNorm-adjacent behavior\n\n\nRelevant public warnings:\n\n  * TFLite quantized MultiHeadAttention issue\n  * TFLite Micro quantized Softmax zero-point issue\n  * LiteRT Torch LayerNorm full-INT8 issue\n\n\n\nIf a hardware vendor says “we support INT8 matrix multiplication,” that is not enough. Cross-attention requires correct scale propagation through the whole attention block.\n\n* * *\n\n# Solution 9: Allow a precision exception if possible\n\nIf product constraints can change, the most natural accuracy fix is:\n\n\n    INT8 weights\n    +\n    INT16 or float activations for attention-sensitive paths\n\n\nLiteRT documents a 16x8 mode:\n\n  * LiteRT 16x8 post-training integer quantization\n\n\n\nThis can help when activations are sensitive to quantization, but runtime/delegate support is often limited.\n\nIf 16x8 improves quality but fails due to `TILE` or another unsupported op, the diagnostic meaning is still useful:\n\n\n    The model probably needs more activation precision.\n    The current delegate cannot execute the more accurate path.\n\n\nPossible compromise:\n\n\n    INT8 encoder\n    INT8 FFN/projections\n    INT16 or float cross-attention score path\n    INT8 output projection\n\n\nThis is not pure full-INT8, but it is often closer to what Transformer quantization actually needs.\n\n* * *\n\n# Solution 10: Distill or redesign the model for the target\n\nIf full-INT8 TFLite is absolutely mandatory and QAT/custom delegate work is too expensive, the best product path may be to change the model.\n\nOptions:\n\n\n    smaller encoder-decoder Transformer\n    fewer decoder layers\n    smaller hidden size\n    shorter max source length\n    fixed decoder-step window\n    reduced vocabulary\n    domain-specific translation model\n    non-autoregressive model if task allows\n    RNN/Conv seq2seq model if task allows\n\n\nTrain with deployment constraints from the beginning:\n\n\n    static shapes\n    teacher-forced decoder-step training\n    QAT during fine-tuning\n    delegate-supported ops only\n    fixed source length\n    fixed decoder step shape\n\n\nThis is less elegant, but often more robust than trying to force a general-purpose pretrained Transformer decoder into a strict embedded INT8 delegate.\n\n* * *\n\n# Solution 11: Change runtime if allowed\n\nIf TFLite is negotiable, use a Transformer-native runtime.\n\n## CTranslate2\n\nCTranslate2 supports many encoder-decoder Transformer families and multiple quantization modes.\n\nUseful links:\n\n  * CTranslate2 GitHub\n  * CTranslate2 Transformers guide\n  * CTranslate2 quantization\n\n\n\nThis is the easiest way to answer:\n\n\n    Can this model family be quantized usefully at all?\n\n\nIf CTranslate2 INT8 works while TFLite INT8 fails, then the model is not inherently unquantizable. The TFLite path is the issue.\n\n## ONNX Runtime\n\nONNX Runtime has a more mature Transformer quantization story than TFLite for many workloads.\n\nUseful links:\n\n  * ONNX Runtime quantization docs\n  * Optimum ONNX export guide\n  * Optimum ONNX export functions\n\n\n\nImportant caveat:\n\n> ONNX Runtime success does not prove full-INT8 TFLite will work.\n\nONNX Runtime docs generally recommend dynamic quantization for Transformer-based models, while your target requires static full-INT8 behavior. Those are different deployment regimes.\n\n* * *\n\n# Recommended execution plan\n\nIf TFLite is mandatory, I would do this in order.\n\n## Step 1: Build split FP32 TFLite\n\nCreate:\n\n\n    encoder_fp32.tflite\n    decoder_step_fp32.tflite\n\n\nVerify:\n\n\n    split FP32 output ≈ original Transformers output\n\n\nDo not quantize until this works.\n\n* * *\n\n## Step 2: Quantize encoder only\n\nCreate:\n\n\n    encoder_int8.tflite\n    decoder_step_fp32.tflite\n\n\nIf quality remains good, the encoder is not the blocker.\n\n* * *\n\n## Step 3: Quantize decoder with decoder-specific calibration\n\nCreate:\n\n\n    decoder_step_int8.tflite\n\n\nUse representative samples with:\n\n\n    decoder_input_ids\n    decoder_attention_mask\n    encoder_hidden_states\n    encoder_attention_mask\n\n\nTest:\n\n\n    FP32 encoder + INT8 decoder\n    INT8 encoder + INT8 decoder\n\n\n* * *\n\n## Step 4: Test source sensitivity\n\nCompare first-step logits for two unrelated source sentences.\n\nIf INT8 logits are nearly identical, the decoder is source-blind.\n\n* * *\n\n## Step 5: Debug cross-attention tensors\n\nUse Quantization Debugger around:\n\n\n    Q\n    K\n    V\n    QK^T\n    Softmax\n    context\n    residual\n    LM head\n\n\nFind the first catastrophic divergence.\n\n* * *\n\n## Step 6: Apply one targeted rescue\n\nFailure location | Targeted fix\n---|---\nEncoder output boundary | Explicit requantization bridge\nK/V projections | Move K/V projection to encoder side\nQK score path | Custom scale handling or higher precision\nSoftmax | Custom op/delegate or precision exception\nResidual merge | QAT or scale control\nLM head | Better calibration or QAT\n\n* * *\n\n## Step 7: Try decoder-step QAT\n\nUse teacher-forced target prefixes.\n\nDo not start with the fused model.\n\n* * *\n\n## Step 8: Validate CPU INT8 before delegate\n\nIf CPU INT8 fails, the model is still quantization-broken.\n\nIf CPU INT8 works and delegate fails, the problem is delegate support.\n\n* * *\n\n# What not to do\n\nDo not keep iterating on fused PTQ as the main path.\n\nDo not add only more encoder-side calibration data.\n\nDo not assume `inference_output_type=tf.float32` is the root cause.\n\nDo not assume ONNX/CTranslate2 success transfers directly to TFLite.\n\nDo not attempt QAT on the high-level fused Hugging Face model first.\n\nDo not debug the custom delegate until CPU INT8 is correct.\n\n* * *\n\n# Practical answer\n\nIf the question is:\n\n> Is there a solution?\n\nThe honest answer is:\n\n> **Yes, in principle. But not as a turnkey`TFLiteConverter` PTQ workflow.**\n\nThe most plausible TFLite-native solution is:\n\n\n    1. split encoder and decoder_step\n    2. calibrate decoder_step explicitly with real decoder prefixes\n    3. handle encoder→decoder requantization\n    4. use first-step source-sensitivity tests\n    5. use Quantization Debugger around cross-attention\n    6. use decoder-step QAT if PTQ fails\n    7. add custom delegate support only after CPU INT8 works\n\n\nThe most plausible non-TFLite solution is:\n\n\n    CTranslate2 or ONNX Runtime\n\n\nThe most robust product solution, if TFLite full INT8 is mandatory and QAT still fails, is:\n\n\n    distill or redesign the model for the delegate\n\n\n* * *\n\n# Short summary\n\n  * There is probably **no simple converter flag** that fixes this.\n  * Fused full-INT8 PTQ is probably a dead end for this model class.\n  * The first real solution is `encoder.tflite + decoder_step.tflite`.\n  * The decoder needs representative calibration with real decoder prefixes.\n  * The encoder→decoder quantized boundary must be handled explicitly.\n  * Cross-attention K/V may need to move to the encoder side.\n  * Use Quantization Debugger to locate the first bad tensor.\n  * Decoder-step QAT is the next realistic TFLite-native path.\n  * A custom attention delegate may be required for strict embedded INT8.\n  * If runtime constraints can change, CTranslate2 or ONNX Runtime is far more mature.\n  * If constraints cannot change, distillation/redesign may be the most reliable product path.\n\n",
  "title": "PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion"
}