{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicpgngtfihnlgtsjgxynkh5bpqyosq6x47reqnsg6ttnekc5cmqfe",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlktmnn67e22"
},
"path": "/t/gemma-4-e4b-latency-optimisations/175910#post_1",
"publishedAt": "2026-05-11T06:39:07.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Working on a banking assistant pipeline using Gemma 4 for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference.\n\nPipeline includes:\n\n * ASR\n\n * text normalization\n\n * fuzzy/phonetic name correction\n\n * single-pass intent + entity extraction\n\n * async FastAPI serving\n\n\n\n\nI’m trying to reduce latency further, maybe less than 500 ms.\n\nQuestions:\n\n 1. What are the best optimizations for Gemma 4 inference in production?\n\n 2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload?\n\n 3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements?\n\n 4. Has anyone optimized small structured-output workloads like this on L4 specifically?\n\n\n\n\nWould love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.",
"title": "Gemma 4 e4b latency optimisations"
}