PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion
I’m trying to deploy a seq2seq encoder-decoder model on an embedded target that only accepts INT8 TFLite models. The conversion via TFLiteConverter completes without errors, but the resulting model is completely broken at inference — suggesting the converter is not handling the encoder-decoder architecture correctly under full INT8 quantization.**
Environment
tensorflow 2.13,transformers 4.40- macOS (conversion) → embedded Linux with INT8 hardware delegate (inference)
Problem
Converting a fused encoder-decoder seq2seq model to INT8 using TFLiteConverter with the following setup:
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
Conversion completes without errors, but the model generates repeated tokens for any input (BLEU drops from 23.9 to 0.04). The decoder stops using encoder context entirely from the first inference step.
EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8 is not viable — TILE op unsupported at runtime.
Question
Is this a known limitation of TFLiteConverter PTQ for encoder-decoder architectures? Is there a recommended calibration strategy or converter configuration for fused encoder-decoder graphs with cross-attention?
Open to any working approach to move forward.
Reproducible notebook available on request.
Discussion in the ATmosphere