Successfully Running Gemma4-26B On-Prem? Looking to Discuss Deployment Struggles & Stable Setups
There do seem to be some success stories. Since it’s quite complex, here are the details:
My read is that Gemma 4 26B-A4B is deployable, but not casually deployable.
There are now enough working examples to say “yes, people are getting it running.” But the pattern I see is also pretty clear: successful setups tend to pin the entire stack tightly instead of mixing arbitrary latest versions of CUDA, PyTorch, Transformers, vLLM, quantized weights, Docker images, and Kubernetes settings.
So I would not frame this as “Gemma 4 is broken.” I would frame it as:
Gemma 4 26B-A4B sits in a fast-moving compatibility zone: new Gemma4 model support, MoE serving, long-context KV-cache pressure, quantized weights, CUDA/Blackwell containers, vLLM support freshness, reasoning/tool parsers, and Transformers v5-era dependency movement.
That makes it very different from deploying an older dense text-only model.
Working examples / reference deployments
1. vLLM has an official Gemma 4 recipe
vLLM’s Gemma 4 guide explicitly covers google/gemma-4-26B-A4B-it.
Useful links:
- Gemma 4 Usage Guide - vLLM Recipes
- Google/gemma-4-26B-A4B-it - vLLM Recipes
The recipe describes Gemma 4 as a multimodal model family with support for structured thinking/reasoning, function calling with a custom tool-use protocol, and OpenAI-compatible serving through vLLM.
The model-specific recipe is especially useful because it names several deployment-relevant properties:
google/gemma-4-26B-A4B-it- MoE model
- 26B total parameters
- about 4B active parameters
- 128 experts
- top-8 routing
- text + image support
- thinking mode
- tool-use protocol
- vLLM 0.19.1+ expectation
- memory guidance around
--max-model-len,--gpu-memory-utilization, multimodal profiling, FP8 KV cache, and async scheduling
A minimal conceptual starting point looks like this:
vllm serve google/gemma-4-26B-A4B-it \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--dtype bfloat16 \
--data-parallel-size 1
I would treat that as a baseline sanity check , not as a final production command.
The important part is not the exact command. The important part is the shape:
- one model
- one known-good vLLM path
- short or moderate context first
- BF16 if memory allows
- no GGUF
- no runtime MXFP4
- no data parallelism at the beginning
- no tool parser or reasoning parser until plain chat works
2. Google has a GKE + vLLM guide for Gemma 4
Google’s GKE guide includes a dedicated vLLM manifest path for Gemma 4 26B-A4B-it:
- Serve Gemma open models using GPUs on GKE with vLLM
A very important detail: Google’s example limits the context window to 16K with:
--max-model-len=16384
The guide also says that if you want a larger context window, up to 128K, you need to adjust the manifest and node-pool configuration with more GPU capacity.
That is a useful signal. Even an official Kubernetes example does not start by maxing out the context window. It starts conservatively.
For on-prem Kubernetes, I would read that as:
- do not start at 128K
- do not assume “the model loads” means “the service can handle realistic KV cache”
- leave memory headroom
- treat
max-model-lenas an infrastructure decision, not just a model option
3. Red Hat has a Day-0 Gemma 4 + vLLM guide
Red Hat also has a practical guide:
- Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide
The most useful line from that guide is the memory reality check:
The 26B A4B variant requires approximately 49 GB of GPU memory for model weights alone, so a single 80 GB GPU is recommended.
That explains a lot of confusion around “A4B.”
Yes, A4B means only a subset of parameters is active per token. But it does not mean you can treat the model like a normal 4B model from a serving-memory perspective. The full MoE weights still matter, and the serving engine still needs room for KV cache, CUDA graphs, batching, fragmentation overhead, and runtime buffers.
So:
- “4B active” helps compute cost.
- It does not make this a normal 4B deployment.
- BF16 still wants serious memory.
- 80GB-class hardware is the conservative path.
4. There are on-prem-ish DGX Spark / GB10 examples
This Japanese write-up is useful because it is close to the kind of on-prem/Kubernetes pain people actually hit:
- Gemma 4 26B-A4B-it を DGX Spark(GB10)で vLLM で動かす
The useful part is not only “it worked.” The useful part is the operational lesson:
- memory limits need to be generous
max-model-lenshould not be too ambitious at first- single-GPU workloads and RollingUpdate can be a bad fit
- first startup is heavy
- local cache makes testing easier
- base environment consistency still matters
That is exactly the kind of evidence I would show someone struggling with an on-prem deployment. It supports the idea that the model can be served, but the runtime contract matters a lot.
5. There are DGX Spark / Blackwell / NVFP4-oriented examples
There are also reports around DGX Spark, Blackwell GPUs, NVFP4, and Gemma-specific vLLM images:
- DGX Spark で Gemma 4 26B-A4B / 31B Dense を vLLM で試す
- Gemma 4 Models - which vLLM version? Any PRs spotted? - NVIDIA Developer Forums
- Deploying Gemma 4 26B A4B on an RTX 5090
- Finishing What We Started: Gemma 4 NVFP4 on vLLM, Desktop Blackwell, WSL2
These are encouraging, but they also reinforce the same point: exact combinations matter.
For example, some reports rely on:
vllm/vllm-openai:gemma4vllm/vllm-openai:gemma4-cu130- nightly images
- patched Gemma4 loader files
- Blackwell-specific CUDA/container assumptions
- NVFP4 or AWQ model repos
- special MoE backend flags
So these examples are not “just pip install the latest vLLM and go.” They are better read as known-good or nearly-known-good stack recipes.
Why this is hard
The hard part is not only “large model serving.” It is several layers moving at once.
1. MoE is not the same as a small dense model
Gemma 4 26B-A4B is a MoE model. That changes the failure modes.
A dense model is conceptually simpler: all parameters are active on every forward pass. With MoE, each token is routed through a subset of experts, so you get lower active compute, but you also get:
- expert routing
- expert weight loading
- MoE-specific kernels
- MoE-specific quantization paths
- different parallelism concerns
- serving-engine support requirements
That is why “26B total / 4B active” is not equivalent to “just serve a 4B model.”
A useful mental model:
4B active helps compute.
26B total still affects weight memory and loader complexity.
MoE still affects kernels, routing, quantization, and parallelism.
2. “Server started” is not the same as “inference is stable”
There is a vLLM issue where Gemma 4 MoE 26B-A4B can load, capture CUDA graphs, and start the API server, but fail on the first inference request when using data parallelism:
- Gemma 4 MoE (26B-A4B) crashes with --data-parallel-size > 1
That distinction matters a lot.
For this model, I would validate in stages:
- model download works
- tokenizer/config load works
- weights load
- CUDA graph capture completes
- API server starts
/v1/modelsworks- first
/v1/chat/completionsworks - repeated requests work
- concurrent requests work
- longer context works
- streaming works
- tool calling works
- reasoning parser works
- structured output works
Those are not the same test.
3. Runtime MXFP4 has had Gemma 4 MoE-specific failures
There is a vLLM issue for runtime MXFP4 quantization:
- Gemma 4 MoE (26B-A4B) — runtime MXFP4 quantization crashes during weight loading
The short version: --quantization mxfp4 crashes during weight loading for the Gemma 4 MoE model, even with --data-parallel-size 1.
This is a good example of why “just quantize it” is not enough. Quantization support has to exist at the exact intersection of:
- model architecture
- MoE loader
- weight tensor layout
- quantization format
- kernel backend
- vLLM version
- CUDA / GPU generation
- model repo format
So I would avoid runtime MXFP4 as a first attempt.
4. AWQ / NVFP4 support can be tag-dependent
There is also an issue suggesting AWQ behavior is tag-sensitive:
- Gemma 4 26B-A4B AWQ INT4: latest v0.19.1 broken, latest gemma4 DockerHub tag works
This is exactly the kind of issue that makes “install latest stable” a weak strategy. Sometimes the model-specific image has the right patch; sometimes nightly has the patch but introduces a different breakage; sometimes the release tag is behind one model-specific PR.
So the safer approach is:
model repo + vLLM image + CUDA version + quant format + flags
as one unit.
Not:
latest model + latest vLLM + latest transformers + latest CUDA + random quant
5. GGUF is not a universal vLLM path
There are discussions about Gemma 4 GGUF not working cleanly with vLLM:
- Unsloth Gemma 4 26B-A4B-it GGUF discussion: vLLM failure
I would keep the runtime lanes separate:
GGUF -> llama.cpp / Ollama / LM Studio style runtimes
vLLM -> HF safetensors or vLLM-supported quantized repos
Transformers -> current model docs, tokenizer, processor, model class support
A model being on Hugging Face does not mean every file format is equally natural for every runtime.
6. Tool calling / reasoning / structured output are separate validation layers
vLLM’s Gemma 4 guide mentions tool-use protocol and reasoning support, but that does not mean every tool parser / reasoning parser / structured output path is equally mature.
Relevant links:
- Gemma 4 Usage Guide - vLLM Recipes
- Gemma4 parser / reasoning / tool-call related issues in vLLM
- Model seems to have issues in vLLM, characters duplication - Gemma 4 discussion
I would not test the model first with an agent framework, auto-tool-choice, streaming, structured output, and a quantized MoE backend all at once.
Plain chat first. Then one feature at a time.
7. Long context is a serving-memory problem, not just a model-card feature
vLLM exists partly because KV cache management is a major bottleneck in LLM serving.
Useful background:
- Efficient Memory Management for Large Language Model Serving with PagedAttention
The PagedAttention paper explains why high-throughput serving struggles when KV cache memory is huge, grows and shrinks dynamically, and is wasted through fragmentation or duplication.
That matters here because --max-model-len directly changes KV-cache pressure. A model supporting long context does not mean your deployment can serve long context at your target concurrency.
This is why Google’s GKE example starting at 16K is meaningful.
Comparison with Qwen3.5 / Transformers v5 migration pain
I think Gemma 4 is in the same broad ecosystem migration zone as Qwen3.5, but the failure modes are somewhat different.
Qwen3.5 exposes the Transformers v5 boundary more directly.
Useful examples:
- vLLM issue: Qwen3.5 Model Tokenizer Loading Failure
- Unsloth issue: Qwen3.5 requires transformers>=5, but current vLLM dependency requires <5
- Qwen3.5 HF discussion: Transformers does not recognize qwen3_5
- Unsloth Qwen3.5 Fine-tuning Guide
The vLLM Qwen3.5 issue is especially clear: the model config expects Transformers >=5.0, while the vLLM image still requires Transformers >=4.56.0,<5.0.0. That creates a direct architecture/tokenizer recognition failure.
Gemma 4 seems different. It has more official serving paths already, including vLLM recipes and cloud/Kubernetes examples. But the pain shifts toward:
- MoE serving
- quantization loaders
- model-specific vLLM tags
- CUDA/Blackwell compatibility
- KV cache sizing
- tool/reasoning parser behavior
- Kubernetes runtime contract
- container vs native install differences
So I would summarize the contrast like this:
Qwen3.5:
more obvious Transformers v5 architecture/tokenizer dependency pain.
Gemma 4:
more official serving paths, but more practical pain around vLLM MoE serving,
quantization, parser behavior, context length, and deployment runtime.
Why “latest everything” is risky
Transformers v5 and huggingface_hub v1.0 are also part of the background.
Useful links:
- Transformers v5 Migration Guide
- huggingface_hub v1.0 migration guide
- Hugging Face Hub v1.0 announcement
The Transformers v5 migration guide notes that v5 pins huggingface_hub>=1.0.0. The Hub v1.0 migration brings its own changes, including a Python floor and HTTP stack migration to httpx.
So a “simple” model support upgrade can pull on:
- Transformers version
huggingface_hubversion- Python version
- HTTP client behavior
- downstream libraries pinned to Transformers 4.x
- vLLM images with pinned dependency ranges
- training/fine-tuning libraries
- inference endpoint containers
- notebook or hosted runtime images
This is why the issue often feels larger than the model itself.
My recommended debugging / deployment path
I would reduce the problem aggressively.
Phase 1: plain known-good serving
Start with:
- Gemma-specific vLLM image or official recipe
- BF16 if memory allows
- 1 GPU first
- data-parallel-size=1
- max-model-len=4096 or 16384
- no GGUF
- no runtime MXFP4
- no tool calling
- no reasoning parser
- no structured output
- no agent framework
- no Kubernetes rollout complexity if possible
Validate:
1. container starts
2. model downloads
3. weights load
4. API server starts
5. /v1/models works
6. minimal /v1/chat/completions works
7. repeated minimal requests work
Phase 2: increase context
Only after plain serving works:
- raise max-model-len to 16K
- test memory pressure
- test repeated requests
- test concurrency
- watch KV cache behavior
Do not jump straight to 128K.
Phase 3: add throughput features
Then test:
- prefix caching
- chunked prefill
- async scheduling
- FP8 KV cache
- max-num-seqs
- batching behavior
Phase 4: add quantization
Then try one quantization path at a time:
- known working AWQ repo + matching vLLM image
- known working NVFP4 repo + matching vLLM image
- avoid runtime MXFP4 until the exact issue status is clear
Keep a matrix:
model repo:
vLLM image:
CUDA version:
driver:
GPU:
quant:
flags:
max-model-len:
works / fails:
failure stage:
Phase 5: add tools / reasoning
Only after plain chat and concurrency are stable:
- --reasoning-parser gemma4
- --enable-auto-tool-choice
- --tool-call-parser gemma4
- streaming
- structured output
- agent framework integration
Validate these separately. A working plain chat endpoint does not prove a working tool-calling endpoint.
What I would tell someone struggling with this
I would say:
Your difficulties look normal for this model/runtime boundary. Gemma 4 26B-A4B has real working deployment paths, but the stable examples usually pin the stack tightly and start conservatively. I would not debug this as one bug. I would split it into architecture support, vLLM version, quantization path, CUDA/GPU compatibility, context length, Kubernetes runtime behavior, and parser/tooling behavior.
And:
For a first stable run, avoid native installs, avoid GGUF-on-vLLM, avoid runtime MXFP4, avoid data parallelism, avoid long context, and avoid tool/reasoning parsers. Start with a Gemma-specific vLLM container, BF16 if memory allows,
data-parallel-size=1,max-model-len=4096or16384, and plain chat completions. Then add one feature at a time.
Short conclusion
Gemma 4 26B-A4B is not vaporware and not unsupported. There are official recipes, cloud/Kubernetes guides, Red Hat guidance, and community/GB10/Blackwell reports showing that it can run.
But it is also not a casual drop-in model.
The difficulty comes from the intersection of:
- MoE architecture
- 26B total weights despite 4B active compute
- long-context KV-cache pressure
- vLLM support freshness
- quantization format maturity
- CUDA / Blackwell / container compatibility
- Transformers v5-era dependency movement
- tool-use / reasoning parser maturity
- Kubernetes runtime details
So the safest summary is:
Gemma 4 26B-A4B is deployable, but not casually deployable. Use a known-good stack, start small, prove plain chat first, then add context, quantization, parallelism, and tool/reasoning features one at a time.
Discussion in the ATmosphere