Gemma 4 e4b latency optimisations
Hugging Face Forums [Unofficial]
May 11, 2026
Working on a banking assistant pipeline using Gemma 4 for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference.
Pipeline includes:
* ASR
* text normalization
* fuzzy/phonetic name correction
* single-pass intent + entity extraction
* async FastAPI serving
I’m trying to reduce latency further, maybe less than 500 ms.
Questions:
1. What are the best optimizations for Gemma 4 inference in production?
2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload?
3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements?
4. Has anyone optimized small structured-output workloads like this on L4 specifically?
Would love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.
Discussion in the ATmosphere