External Publication
Visit Post

Gemma 4 e4b latency optimisations

Hugging Face Forums [Unofficial] May 11, 2026
Source
Working on a banking assistant pipeline using Gemma 4 for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference. Pipeline includes: * ASR * text normalization * fuzzy/phonetic name correction * single-pass intent + entity extraction * async FastAPI serving I’m trying to reduce latency further, maybe less than 500 ms. Questions: 1. What are the best optimizations for Gemma 4 inference in production? 2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload? 3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements? 4. Has anyone optimized small structured-output workloads like this on L4 specifically? Would love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.

Discussion in the ATmosphere

Loading comments...