Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicpgngtfihnlgtsjgxynkh5bpqyosq6x47reqnsg6ttnekc5cmqfe",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlktmnn67e22"
  },
  "path": "/t/gemma-4-e4b-latency-optimisations/175910#post_1",
  "publishedAt": "2026-05-11T06:39:07.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Working on a banking assistant pipeline using Gemma 4 for ASR + normalization + intent extraction + entity extraction + QnA in a single flow. Current latency is around ~6s on an NVIDIA L4 and ~1.5s on an H200 for end-to-end inference.\n\nPipeline includes:\n\n  * ASR\n\n  * text normalization\n\n  * fuzzy/phonetic name correction\n\n  * single-pass intent + entity extraction\n\n  * async FastAPI serving\n\n\n\n\nI’m trying to reduce latency further, maybe less than 500 ms.\n\nQuestions:\n\n  1. What are the best optimizations for Gemma 4 inference in production?\n\n  2. Would vLLM/TensorRT-LLM/Flash Attention significantly help for this workload?\n\n  3. Any recommendations around batching, quantization, KV cache, or async pipeline improvements?\n\n  4. Has anyone optimized small structured-output workloads like this on L4 specifically?\n\n\n\n\nWould love suggestions from people deploying Gemma/Qwen/Llama models in real-time systems.",
  "title": "Gemma 4 e4b latency optimisations"
}