Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreiepfvi6eb2smghmebpewatgm62du5aewcrds2gls2a2wwpidzv54a",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlu2k7kvkab2"
  },
  "path": "/t/gemma-4-e4b-latency-optimisations/175910#post_2",
  "publishedAt": "2026-05-14T22:29:31.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "FlashAttention issue: support head_dim=512 for Gemma 4 global attention layers",
    "Transformers issue: per-layer FlashAttention for Gemma 4",
    "vLLM issue: Gemma 4 E4B extremely slow because vLLM forces TRITON_ATTN",
    "vLLM Gemma 4 spec decode code/docs noting TRITON_ATTN because of heterogeneous head dimensions",
    "NVIDIA L4 product page",
    "NVIDIA H200 product page",
    "GPU comparison: H200 vs L4",
    "Gemma 4 model card",
    "Gemma 4 audio guide",
    "Gemma 4 prompt formatting guide",
    "vLLM Gemma 4 Usage Guide",
    "vLLM Gemma 4 multimodal implementation docs",
    "NVIDIA Riva ASR overview",
    "NVIDIA Riva ASR performance docs",
    "Whisper-Streaming",
    "vLLM prefix caching design",
    "vLLM benchmark CLI",
    "vLLM structured-output benchmark script",
    "vLLM prefix-caching benchmark script",
    "vLLM Structured Outputs",
    "SGLang Structured Outputs",
    "Outlines docs",
    "llguidance",
    "XGrammar paper",
    "Generating Structured Outputs from Language Models",
    "vLLM issue: reasoning-parser gemma4 can bypass structured output with enable_thinking=false",
    "Ollama issue: think=false breaks structured output for Gemma 4",
    "RapidFuzz docs",
    "RapidFuzz GitHub",
    "vLLM attention backends",
    "vLLM prefix caching",
    "vLLM CUDA Graphs",
    "vLLM torch.compile integration",
    "vLLM MTP docs",
    "SGLang project",
    "SGLang structured outputs",
    "SGLang attention backends",
    "SGLang RadixAttention concept",
    "SGLang paper",
    "SGLang issue: Gemma 4 E4B FP8 KV cache crash",
    "TensorRT-LLM overview",
    "NVIDIA TensorRT-LLM product page",
    "TensorRT-LLM supported models",
    "FlashAttention GitHub",
    "FlashAttention-2 page",
    "FlashAttention issue: Gemma 4 head_dim=512",
    "KIVI KV-cache quantization paper",
    "SGLang Gemma 4 E4B FP8 KV issue",
    "Google Gemma MTP docs",
    "Gemma 4 E4B assistant checkpoint",
    "Speculative decoding paper",
    "Medusa paper"
  ],
  "textContent": "Gemma 4 is an excellent model, but it isn’t well-suited for GPUs with older architectures. (This is generally true for models starting with Gemma 3.)\n\nWhen it comes to latency, the issues can be broadly divided into two categories: those that can be resolved by changing the backend, and those that require a rethinking of the pipeline itself:\n\n* * *\n\n# Gemma 4 E4B latency optimisation notes for a banking assistant pipeline\n\nYou are probably looking at a **compound latency problem** , not one missing flag.\n\nCurrent setup:\n\n  * Model: `google/gemma-4-E4B-it` or similar\n  * Hardware:\n    * NVIDIA L4: ~6s end-to-end\n    * NVIDIA H200: ~1.5s end-to-end\n  * Pipeline:\n    * ASR\n    * text normalization\n    * fuzzy / phonetic name correction\n    * intent extraction\n    * entity extraction\n    * QnA\n    * async FastAPI serving\n  * Target: ideally <500 ms\n\n\n\nMy main conclusion:\n\n> <500 ms is realistic for the common banking-command path only if the pipeline is decomposed.\n>  <500 ms is unlikely for one all-in-one Gemma 4 E4B call that does audio → ASR → normalization → fuzzy matching → extraction → QnA on an L4.\n\nThe best path is not “just add FlashAttention” or “just use vLLM”. The best path is:\n\n\n    streaming ASR\n      -> deterministic normalization\n      -> external fuzzy / phonetic candidate lookup\n      -> fast intent/entity path\n      -> Gemma 4 E4B only for ambiguity, fallback, and QnA\n\n\n* * *\n\n## 1. Main likely causes\n\n### 1.1 Gemma 4 E4B has an attention-backend constraint\n\nGemma 4 is not just a normal small dense decoder from a serving point of view. The important detail is its mixed attention layout:\n\n\n    Sliding/local attention layers:\n      head_dim = 256\n\n    Global/full attention layers:\n      global_head_dim = 512\n\n\nThat matters because the usual FlashAttention-2 path supports head dimensions up to 256, while Gemma 4 global attention layers need 512. See:\n\n  * FlashAttention issue: support head_dim=512 for Gemma 4 global attention layers\n  * Transformers issue: per-layer FlashAttention for Gemma 4\n  * vLLM issue: Gemma 4 E4B extremely slow because vLLM forces TRITON_ATTN\n  * vLLM Gemma 4 spec decode code/docs noting TRITON_ATTN because of heterogeneous head dimensions\n\n\n\nThis is the key trap:\n\n\n    L4 can generally run FlashAttention-2.\n    But Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,\n    because its global attention layers use global_head_dim=512.\n\n\nCheck logs for lines like:\n\n\n    Gemma4 model has heterogeneous head dimensions\n    Forcing TRITON_ATTN backend\n    Using AttentionBackendEnum.TRITON_ATTN\n    Using AttentionBackendEnum.FLASH_ATTN\n    Using AttentionBackendEnum.FLASHINFER\n\n\nIf your L4 run is forced onto `TRITON_ATTN`, that can explain a large part of the latency gap.\n\n* * *\n\n### 1.2 L4 vs H200 is a huge memory-bandwidth mismatch\n\nThe L4/H200 latency gap is plausible even before considering software. LLM inference, especially decode at small batch sizes, is often memory-bandwidth sensitive.\n\nRelevant hardware context:\n\n  * NVIDIA L4 product page\n  * NVIDIA H200 product page\n  * GPU comparison: H200 vs L4\n\n\n\nApproximate memory bandwidth:\n\n\n    L4:   ~300 GB/s\n    H200: ~4.8 TB/s\n\n    H200 / L4 bandwidth ratio:\n      4800 / 300 ≈ 16x\n\n\nSo a big L4/H200 gap does not necessarily mean your code is broken. It may mean:\n\n\n    lower memory bandwidth\n    + Gemma 4 attention fallback\n    + long prefill\n    + audio encoder cost\n    + structured-output overhead\n    = multi-second L4 latency\n\n\n* * *\n\n### 1.3 Audio inside Gemma is convenient, but probably not the lowest-latency ASR path\n\nGemma 4 E2B/E4B supports audio input. See:\n\n  * Gemma 4 model card\n  * Gemma 4 audio guide\n  * Gemma 4 prompt formatting guide\n\n\n\nThat is useful for prototyping and multimodal reasoning, but for a sub-500 ms banking assistant, I would not use Gemma as the default ASR engine.\n\nIn vLLM, Gemma 4’s multimodal path is not necessarily optimized the same way as the language-model path. The vLLM Gemma 4 guide and model implementation notes are worth reading:\n\n  * vLLM Gemma 4 Usage Guide\n  * vLLM Gemma 4 multimodal implementation docs\n\n\n\nFor low-latency voice systems, use a dedicated streaming ASR path where possible:\n\n  * NVIDIA Riva ASR overview\n  * NVIDIA Riva ASR performance docs\n  * Whisper-Streaming\n\n\n\nArchitecture-wise:\n\n\n    Bad for latency:\n      full audio -> Gemma 4 -> ASR + extraction + QnA\n\n    Better:\n      streaming ASR -> text normalization -> fuzzy lookup -> extraction/QnA\n\n\n* * *\n\n### 1.4 Your workload is probably prefill-bound, not decode-bound\n\nIntent/entity extraction usually emits a tiny JSON object:\n\n\n    {\n      \"intent\": \"transfer_money\",\n      \"amount_minor\": 500000,\n      \"currency\": \"JPY\",\n      \"recipient_candidate_id\": \"p_001\",\n      \"needs_confirmation\": true\n    }\n\n\nThat output may be only 30-80 tokens.\n\nFor short outputs, latency is often dominated by **prefill** : the model reading the prompt before producing the first token.\n\nYour prompt may include:\n\n  * banking policy\n  * intent definitions\n  * entity schema\n  * tool descriptions\n  * JSON schema\n  * examples\n  * normalization instructions\n  * fuzzy-name candidate lists\n  * retrieved QnA context\n\n\n\nIf that becomes 1K-4K+ tokens, your hot path is likely dominated by input processing, not generation.\n\nRelevant reading:\n\n  * vLLM prefix caching design\n  * vLLM benchmark CLI\n  * vLLM structured-output benchmark script\n  * vLLM prefix-caching benchmark script\n\n\n\nPrompt layout matters.\n\nGood layout:\n\n\n    fixed system prompt\n    fixed banking rules\n    fixed schema instructions\n    fixed examples\n    variable transcript\n    variable recipient candidates\n    variable account context\n\n\nBad layout:\n\n\n    timestamp\n    request id\n    variable user data\n    fixed system prompt\n    fixed schema\n    fixed examples\n\n\nIf variable content is at the top, prefix caching is much less useful.\n\n* * *\n\n### 1.5 Structured output is necessary, but it has latency and correctness traps\n\nStructured output is the right choice for a banking assistant. But it is not free.\n\nUseful docs:\n\n  * vLLM Structured Outputs\n  * SGLang Structured Outputs\n  * Outlines docs\n  * llguidance\n  * XGrammar paper\n  * Generating Structured Outputs from Language Models\n\n\n\nFor Gemma 4 specifically, also watch for thinking/parser issues:\n\n  * vLLM issue: reasoning-parser gemma4 can bypass structured output with enable_thinking=false\n  * Ollama issue: think=false breaks structured output for Gemma 4\n\n\n\nImportant point:\n\n\n    A faster JSON result is not necessarily an optimized constrained-JSON result.\n    It may be unconstrained text that happens to look like JSON.\n\n\nFor banking, verify:\n\n\n    Does every output parse?\n    Are required fields impossible to omit?\n    Are invalid enum values impossible?\n    Are extra fields blocked?\n    Can the model output prose before JSON?\n    Can it invent recipient IDs not in the candidate list?\n\n\nA grammar can enforce JSON shape. Your application still needs to enforce banking semantics.\n\n* * *\n\n## 2. Best production optimizations\n\n### 2.1 Split the system into multiple paths\n\nRecommended architecture:\n\n\n    audio stream\n      -> VAD / endpointing\n      -> streaming ASR\n      -> transcript partials\n      -> deterministic normalization\n      -> fuzzy / phonetic candidate retrieval\n      -> fast intent/entity path\n          -> if high confidence:\n                policy validation + confirmation/tool call\n          -> if ambiguous:\n                Gemma 4 E4B short structured extraction\n          -> if open-ended:\n                Gemma 4 E4B QnA endpoint\n\n\nLatency targets:\n\nPath | Target | Notes\n---|---|---\nSimple command path | <500 ms | Realistic with streaming ASR + non-LLM preprocessing\nAmbiguous Gemma extraction | 500 ms-2 s | More realistic on L4; faster on H200\nFull audio → Gemma → extraction → QnA | <500 ms | Unlikely on L4\nQnA | streaming | Optimize TTFT, not full completion latency\n\nCommon banking commands are usually limited enough for a fast path:\n\n  * balance inquiry\n  * recent transactions\n  * transfer money\n  * card lock/unlock\n  * bill payment\n  * recipient lookup\n  * human handoff\n  * branch/ATM/product/policy QnA\n\n\n\nDo not use the full generative path for every deterministic command.\n\n* * *\n\n### 2.2 Move text normalization to code\n\nNormalization should mostly be deterministic.\n\nExamples:\n\n\n    \"five thousand yen\"       -> 5000 JPY\n    \"tomorrow morning\"        -> normalized date/time\n    \"one two three four\"      -> account number fragment\n    \"oh\" vs \"zero\"            -> digit correction\n    full-width / half-width   -> normalized Japanese text\n    kana / romaji variants    -> canonical search forms\n\n\nThis is faster and more auditable than relying on the LLM.\n\nFor banking, auditability matters. You want logs like:\n\n\n    {\n      \"surface\": \"five thousand yen\",\n      \"normalized_amount_minor\": 500000,\n      \"currency\": \"JPY\",\n      \"rule\": \"currency_parser_v3\"\n    }\n\n\n* * *\n\n### 2.3 Move fuzzy / phonetic name correction outside the LLM\n\nDo candidate generation outside the model:\n\n\n    ASR transcript span\n      -> text normalization\n      -> phonetic expansion\n      -> kana / romaji / kanji variants\n      -> edit distance / token similarity\n      -> account/contact/payee database lookup\n      -> top-k candidates\n\n\nUseful library:\n\n  * RapidFuzz docs\n  * RapidFuzz GitHub\n\n\n\nPass only the top candidates to the model:\n\n\n    {\n      \"heard_name\": \"sato ken\",\n      \"candidates\": [\n        {\n          \"candidate_id\": \"p_001\",\n          \"display_name\": \"佐藤 健\",\n          \"relationship\": \"recent_payee\",\n          \"score\": 0.94\n        },\n        {\n          \"candidate_id\": \"p_002\",\n          \"display_name\": \"斉藤 健\",\n          \"relationship\": \"saved_contact\",\n          \"score\": 0.78\n        }\n      ]\n    }\n\n\nThen Gemma decides:\n\n\n    Is the intent clear?\n    Is the recipient unambiguous?\n    Should the assistant ask for confirmation?\n\n\nDo not pass hundreds of names into the prompt.\n\n* * *\n\n### 2.4 Separate extraction and QnA endpoints\n\nUse different configs.\n\nExtraction endpoint:\n\n\n    input: text only\n    output: shallow JSON\n    max_tokens: 32-96\n    temperature: 0\n    max_model_len: 1024-2048 initially\n    thinking: off only if schema enforcement is verified\n    MTP: off initially\n    prefix cache: on in production\n    schema: shallow\n\n\nQnA endpoint:\n\n\n    input: text + compact retrieved/tool context\n    output: streamed natural language\n    max_tokens: 128-512+\n    temperature: low\n    MTP: test on/off\n    thinking: optional\n    structured output: off unless tool call needed\n\n\nReason:\n\n\n    Extraction is often prefill/schema-bound.\n    QnA is more decode-bound.\n\n\n* * *\n\n### 2.5 Keep the extraction schema shallow\n\nGood hot-path schema:\n\n\n    {\n      \"intent\": \"transfer_money\",\n      \"amount_minor\": 500000,\n      \"currency\": \"JPY\",\n      \"recipient_candidate_id\": \"p_001\",\n      \"needs_confirmation\": true\n    }\n\n\nAvoid hot-path schemas like:\n\n\n    {\n      \"normalization_trace\": [],\n      \"policy_analysis\": {},\n      \"candidate_ranking_explanation\": \"\",\n      \"tool_plan\": [],\n      \"assistant_response\": \"\",\n      \"debug_reasoning\": \"\"\n    }\n\n\nFor hot extraction, use:\n\n  * `intent`\n  * `entities`\n  * `needs_confirmation`\n  * `candidate_id`\n  * `confidence` or `ambiguity_reason`\n  * `fallback_code`\n\n\n\nAvoid:\n\n  * long explanations\n  * model reasoning traces\n  * policy analysis\n  * candidate ranking explanation\n  * natural-language answer in the same extraction output\n\n\n\n* * *\n\n## 3. vLLM vs SGLang vs TensorRT-LLM vs FlashAttention\n\n### 3.1 vLLM\n\nUse vLLM as the first baseline.\n\nRelevant docs:\n\n  * vLLM Gemma 4 Usage Guide\n  * vLLM attention backends\n  * vLLM prefix caching\n  * vLLM CUDA Graphs\n  * vLLM torch.compile integration\n  * vLLM MTP docs\n\n\n\nBaseline command for text-only extraction testing:\n\n\n    vllm serve google/gemma-4-E4B-it \\\n      --max-model-len 2048 \\\n      --gpu-memory-utilization 0.90 \\\n      --limit-mm-per-prompt '{\"image\": 0, \"audio\": 0}'\n\n\nWhy text-only first?\n\n\n    Because you need to know whether the model path itself is fast\n    before adding audio, schema, fuzzy matching, and FastAPI orchestration.\n\n\nCaveat: vLLM may force `TRITON_ATTN` for Gemma 4 E4B because of mixed head dimensions. If so, vLLM may be stable but not as fast as you expect.\n\n* * *\n\n### 3.2 SGLang\n\nSGLang is worth testing for short structured extraction.\n\nRelevant docs:\n\n  * SGLang project\n  * SGLang structured outputs\n  * SGLang attention backends\n  * SGLang RadixAttention concept\n  * SGLang paper\n\n\n\nWhere SGLang may help:\n\n\n    short JSON extraction\n    stable repeated prompt prefixes\n    agentic / multi-step language programs\n    structured-output-heavy workloads\n\n\nCaveat:\n\n  * SGLang issue: Gemma 4 E4B FP8 KV cache crash\n\n\n\nSo start conservatively:\n\n\n    python -m sglang.launch_server \\\n      --model-path google/gemma-4-E4B-it \\\n      --mem-fraction-static 0.90\n\n\nAvoid FP8 KV initially. Start with BF16/auto KV.\n\nMy recommendation:\n\n\n    vLLM baseline first.\n    SGLang A/B test for text-only structured extraction.\n    Start SGLang with BF16/auto KV, not FP8 KV.\n\n\n* * *\n\n### 3.3 TensorRT-LLM\n\nTensorRT-LLM is worth testing, especially on H200, but not as the first fix.\n\nRelevant links:\n\n  * TensorRT-LLM overview\n  * NVIDIA TensorRT-LLM product page\n  * TensorRT-LLM supported models\n\n\n\nTensorRT-LLM is most attractive when:\n\n\n    hardware is H100/H200-class\n    deployment is NVIDIA-native\n    workload is stable\n    shapes are controlled\n    quantization path is validated\n    structured-output requirements are supported\n\n\nBefore committing, validate:\n\n\n    Gemma 4 E4B exact checkpoint\n    audio path\n    guided decoding / structured output\n    MTP / speculative decoding\n    KV-cache reuse\n    quantization format\n    L4 behavior\n    H200 behavior\n    p50/p95/p99 latency\n\n\n* * *\n\n### 3.4 FlashAttention\n\nFlashAttention is not a simple fix here.\n\nRelevant links:\n\n  * FlashAttention GitHub\n  * FlashAttention-2 page\n  * FlashAttention issue: Gemma 4 head_dim=512\n\n\n\nAccurate summary:\n\n\n    L4 can generally run FlashAttention-2.\n    Gemma 4 E4B cannot be assumed to use FlashAttention-2 end-to-end,\n    because Gemma 4 global attention layers use global_head_dim=512.\n\n\nDo not force FlashAttention unless your exact engine version validates that it supports Gemma 4’s mixed layout.\n\n* * *\n\n## 4. Attention backend guidance\n\nFor Gemma 4 E4B, always log the actual backend.\n\n### On L4\n\nBackend | View\n---|---\nauto | Best first baseline\n`TRITON_ATTN` | Likely safe fallback, possibly slower\n`FLASH_ATTN` | Do not assume valid because global head dim is 512\n`FLASHINFER` | Test only if engine accepts it\nFA3 | Not the L4 answer\nSDPA | Debug/correctness fallback\n\n### On H200\n\nBackend | View\n---|---\nauto | Best first baseline\nadvanced Hopper paths | Worth testing if engine supports Gemma 4\n`TRITON_ATTN` | Safe fallback\n`FLASH_ATTN` | Still blocked if 512 global dim unsupported\n`FLASHINFER` | Worth testing only if accepted\nSDPA | Debug/correctness fallback\n\nMain rule:\n\n\n    Do not choose a backend from generic benchmarks.\n    Choose based on what your engine actually uses for Gemma 4 E4B.\n\n\n* * *\n\n## 5. Batching recommendations\n\nFor real-time banking, do not optimize only for throughput. Optimize p95/p99 latency.\n\nUse microbatching:\n\n\n    small max_num_seqs\n    small max_num_batched_tokens\n    minimal queue delay\n    short max_tokens\n    short prompt\n    prefix caching\n\n\nBenchmark:\n\n\n    concurrency: 1, 2, 4, 8, 16\n    prompt tokens: 256, 512, 1024, 2048\n    output tokens: 16, 32, 64, 128\n    schema: off, shallow, production\n    audio: off, on\n\n\nExtraction endpoint:\n\n\n    max_tokens: 32-96\n    temperature: 0\n    top_p: 1\n    schema: shallow\n    prompt: compact\n\n\nQnA endpoint:\n\n\n    max_tokens: 128-512\n    streaming: on\n    MTP: test\n    retrieved context: capped\n\n\n* * *\n\n## 6. Quantization and KV cache\n\nStart with BF16 / auto KV as the correctness baseline.\n\nThen test separately:\n\n\n    BF16 weights + BF16/auto KV\n    quantized weights + BF16/auto KV\n    BF16 weights + FP8 KV\n    quantized weights + FP8 KV\n\n\nDo not assume quantization improves latency. Sometimes it improves memory footprint but hurts latency if the kernel or dequantization path is poor.\n\nFor KV cache:\n\n  * use prefix caching for stable prompts\n  * avoid FP8 KV as the first SGLang Gemma 4 E4B test\n  * validate entity accuracy and JSON correctness after quantization\n\n\n\nRelevant links:\n\n  * vLLM prefix caching\n  * KIVI KV-cache quantization paper\n  * SGLang Gemma 4 E4B FP8 KV issue\n\n\n\n* * *\n\n## 7. MTP / speculative decoding\n\nGemma 4 supports MTP-style acceleration.\n\nRelevant links:\n\n  * Google Gemma MTP docs\n  * vLLM MTP docs\n  * Gemma 4 E4B assistant checkpoint\n  * Speculative decoding paper\n  * Medusa paper\n\n\n\nBut MTP mostly helps decode-heavy workloads.\n\nFor this pipeline:\n\nWorkload | MTP usefulness\n---|---\n30-80 token JSON extraction | probably limited\nshort answer, 100-200 tokens | worth testing\nlonger QnA, 256-512+ tokens | more likely useful\naudio preprocessing | no direct help\nprompt prefill | no direct help\nfuzzy name correction | no help\n\nUse MTP for QnA experiments first, not the tiny extraction JSON path.\n\nExample:\n\n\n    vllm serve google/gemma-4-E4B-it \\\n      --max-model-len 4096 \\\n      --gpu-memory-utilization 0.90 \\\n      --limit-mm-per-prompt '{\"image\": 0, \"audio\": 0}' \\\n      --speculative-config '{\"method\":\"mtp\",\"model\":\"google/gemma-4-E4B-it-assistant\",\"num_speculative_tokens\":4}'\n\n\nVerify the exact syntax against your installed vLLM version.\n\n* * *\n\n## 8. FastAPI / async pipeline improvements\n\nFastAPI is probably not the main cause of 6s latency, but it can become visible after model latency drops.\n\nAvoid:\n\n\n    loading tokenizer/model/client per request\n    building huge schemas per request\n    CPU-bound fuzzy matching in the event loop\n    audio resampling in the event loop\n    blocking HTTP calls inside async handlers\n    unbounded request queues\n    serial execution of independent steps\n\n\nUse:\n\n\n    persistent model client\n    connection pooling\n    uvloop\n    orjson\n    bounded queues\n    timeouts\n    request cancellation\n    process pool for CPU-bound fuzzy matching\n    streaming responses\n    early ASR partials\n    parallel normalization and candidate lookup\n\n\nInstrument stages:\n\n\n    request_received\n    audio_upload_done\n    asr_start\n    asr_partial\n    asr_final\n    normalization_start\n    normalization_end\n    fuzzy_lookup_start\n    fuzzy_lookup_end\n    llm_request_start\n    llm_request_end\n    tool_validation_start\n    tool_validation_end\n    response_sent\n\n\nSeparate:\n\n\n    model latency\n    pipeline latency\n    queueing latency\n    network latency\n    CPU preprocessing latency\n\n\n* * *\n\n## 9. Benchmark plan\n\n### Phase 1: text-only LLM baseline\n\nRun Gemma 4 E4B without audio and without structured output.\n\n\n    vllm serve google/gemma-4-E4B-it \\\n      --max-model-len 2048 \\\n      --gpu-memory-utilization 0.90 \\\n      --limit-mm-per-prompt '{\"image\": 0, \"audio\": 0}'\n\n\nTest:\n\nPrompt tokens | Output tokens\n---|---\n128 | 32\n512 | 32\n1024 | 32\n2048 | 32\n\nRecord:\n\n\n    attention backend\n    TTFT\n    TPOT / ITL\n    total latency\n    GPU utilization\n    GPU memory\n    p50 / p95 / p99\n\n\nIf this is already slow on L4, focus on model/backend/hardware.\n\n* * *\n\n### Phase 2: structured-output overhead\n\nRun the same text-only prompt in three modes:\n\n\n    free text\n    JSON instruction only\n    constrained JSON schema\n\n\nMeasure:\n\n\n    latency delta\n    JSON parse rate\n    schema validity\n    invalid enum prevention\n    required field enforcement\n\n\n* * *\n\n### Phase 3: real banking extraction prompt\n\nAdd:\n\n\n    intent definitions\n    entity definitions\n    confirmation rules\n    recipient candidates\n    small policy summary\n\n\nSweep prompt sizes:\n\n\n    256\n    512\n    1024\n    2048\n    4096\n\n\nIf TTFT grows sharply, you are prefill-bound.\n\n* * *\n\n### Phase 4: audio breakdown\n\nMeasure audio separately:\n\n\n    audio read/upload\n    decode/resample\n    feature extraction\n    audio encoder\n    transcription\n    normalization\n    LLM extraction\n    structured output\n\n\nDo not diagnose only from end-to-end audio latency.\n\n* * *\n\n### Phase 5: vLLM vs SGLang\n\nCompare only the text structured extraction path first:\n\n\n    same model\n    same prompt\n    same schema\n    same max_tokens\n    same GPU\n    same concurrency\n    same KV dtype\n    same warmup\n\n\nTest:\n\n\n    vLLM BF16/auto KV\n    SGLang BF16/auto KV\n    vLLM structured output\n    SGLang structured output\n    vLLM prefix caching\n    SGLang prefix-reuse behavior\n\n\n* * *\n\n### Phase 6: QnA with MTP\n\nAfter extraction is understood, test:\n\n\n    QnA max_tokens=128\n    QnA max_tokens=256\n    QnA max_tokens=512\n\n\nCompare:\n\n\n    MTP off\n    MTP on\n\n\nMTP is more likely to help here than in tiny JSON extraction.\n\n* * *\n\n## 10. Concrete next-step checklist\n\n  1. Run text-only Gemma 4 E4B with no schema and no audio.\n  2. Check attention backend logs.\n  3. Measure TTFT and TPOT separately.\n  4. Add shallow JSON schema and measure overhead.\n  5. Add real banking prompt and run prompt-length sweep.\n  6. Measure audio separately.\n  7. Move normalization and fuzzy matching outside Gemma.\n  8. Use prefix caching with stable prompt layout.\n  9. A/B test vLLM vs SGLang for text-only structured extraction.\n  10. Test MTP only for QnA-length outputs.\n  11. Report p50/p95/p99, not only average latency.\n  12. Treat sub-500 ms as the target for the simple command path, not the all-in-one Gemma path.\n\n\n\n* * *\n\n## Final practical answer\n\nFor this specific setup:\n\n\n    Production default:\n      vLLM first\n\n    Structured-extraction challenger:\n      SGLang\n\n    H200 optimization candidate:\n      TensorRT-LLM\n\n    Attention backend:\n      auto first, log actual backend, expect TRITON_ATTN risk\n\n    FlashAttention:\n      not a simple fix because Gemma 4 global layers use head_dim=512\n\n    ASR:\n      move out of Gemma for the hot path\n\n    Normalization:\n      move to deterministic code\n\n    Fuzzy name correction:\n      move to candidate-generation service\n\n    Gemma 4 E4B:\n      use for ambiguity, fallback extraction, and QnA\n\n    500 ms:\n      realistic for decomposed simple command path\n      unlikely for all-in-one audio -> Gemma -> extraction -> QnA on L4\n\n\nShort version:\n\n  * The current all-in-one Gemma 4 E4B flow is unlikely to hit <500 ms on L4.\n  * The most important model-specific issue is Gemma 4’s mixed attention layout: `head_dim=256` local layers and `global_head_dim=512` global layers.\n  * vLLM may force `TRITON_ATTN`, which can explain poor L4 latency.\n  * FlashAttention is not a simple fix because of Gemma 4’s 512-dimensional global attention heads.\n  * H200 is much faster largely because it has far more memory bandwidth and better high-end inference headroom.\n  * Move ASR, normalization, and fuzzy name correction outside Gemma.\n  * Use Gemma 4 E4B for ambiguity, fallback extraction, and QnA.\n  * Use short prompts, shallow schemas, prefix caching, and separate extraction/QnA endpoints.\n  * Benchmark vLLM first, SGLang for structured extraction, and TensorRT-LLM for H200 production experiments.\n\n",
  "title": "Gemma 4 e4b latency optimisations"
}