External Publication

Successfully Running Gemma4-26B On-Prem? Looking to Discuss Deployment Struggles & Stable Setups

Hugging Face Forums [Unofficial] May 24, 2026

There do seem to be some success stories. Since it’s quite complex, here are the details:

My read is that Gemma 4 26B-A4B is deployable, but not casually deployable.

There are now enough working examples to say “yes, people are getting it running.” But the pattern I see is also pretty clear: successful setups tend to pin the entire stack tightly instead of mixing arbitrary latest versions of CUDA, PyTorch, Transformers, vLLM, quantized weights, Docker images, and Kubernetes settings.

So I would not frame this as “Gemma 4 is broken.” I would frame it as:

Gemma 4 26B-A4B sits in a fast-moving compatibility zone: new Gemma4 model support, MoE serving, long-context KV-cache pressure, quantized weights, CUDA/Blackwell containers, vLLM support freshness, reasoning/tool parsers, and Transformers v5-era dependency movement.

That makes it very different from deploying an older dense text-only model.

Working examples / reference deployments

1. vLLM has an official Gemma 4 recipe

vLLM’s Gemma 4 guide explicitly covers google/gemma-4-26B-A4B-it.

Useful links:

Gemma 4 Usage Guide - vLLM Recipes
Google/gemma-4-26B-A4B-it - vLLM Recipes

The recipe describes Gemma 4 as a multimodal model family with support for structured thinking/reasoning, function calling with a custom tool-use protocol, and OpenAI-compatible serving through vLLM.

The model-specific recipe is especially useful because it names several deployment-relevant properties:

google/gemma-4-26B-A4B-it
MoE model
26B total parameters
about 4B active parameters
128 experts
top-8 routing
text + image support
thinking mode
tool-use protocol
vLLM 0.19.1+ expectation
memory guidance around --max-model-len, --gpu-memory-utilization, multimodal profiling, FP8 KV cache, and async scheduling

A minimal conceptual starting point looks like this:

vllm serve google/gemma-4-26B-A4B-it \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --dtype bfloat16 \
  --data-parallel-size 1

I would treat that as a baseline sanity check , not as a final production command.

The important part is not the exact command. The important part is the shape:

one model
one known-good vLLM path
short or moderate context first
BF16 if memory allows
no GGUF
no runtime MXFP4
no data parallelism at the beginning
no tool parser or reasoning parser until plain chat works

2. Google has a GKE + vLLM guide for Gemma 4

Google’s GKE guide includes a dedicated vLLM manifest path for Gemma 4 26B-A4B-it:

Serve Gemma open models using GPUs on GKE with vLLM

A very important detail: Google’s example limits the context window to 16K with:

--max-model-len=16384

The guide also says that if you want a larger context window, up to 128K, you need to adjust the manifest and node-pool configuration with more GPU capacity.

That is a useful signal. Even an official Kubernetes example does not start by maxing out the context window. It starts conservatively.

For on-prem Kubernetes, I would read that as:

do not start at 128K
do not assume “the model loads” means “the service can handle realistic KV cache”
leave memory headroom
treat max-model-len as an infrastructure decision, not just a model option

3. Red Hat has a Day-0 Gemma 4 + vLLM guide

Red Hat also has a practical guide:

Run Gemma 4 with Red Hat AI on Day 0: A step-by-step guide

The most useful line from that guide is the memory reality check:

The 26B A4B variant requires approximately 49 GB of GPU memory for model weights alone, so a single 80 GB GPU is recommended.

That explains a lot of confusion around “A4B.”

Yes, A4B means only a subset of parameters is active per token. But it does not mean you can treat the model like a normal 4B model from a serving-memory perspective. The full MoE weights still matter, and the serving engine still needs room for KV cache, CUDA graphs, batching, fragmentation overhead, and runtime buffers.

So:

“4B active” helps compute cost.
It does not make this a normal 4B deployment.
BF16 still wants serious memory.
80GB-class hardware is the conservative path.

4. There are on-prem-ish DGX Spark / GB10 examples

This Japanese write-up is useful because it is close to the kind of on-prem/Kubernetes pain people actually hit:

Gemma 4 26B-A4B-it を DGX Spark(GB10)で vLLM で動かす

The useful part is not only “it worked.” The useful part is the operational lesson:

memory limits need to be generous
max-model-len should not be too ambitious at first
single-GPU workloads and RollingUpdate can be a bad fit
first startup is heavy
local cache makes testing easier
base environment consistency still matters

That is exactly the kind of evidence I would show someone struggling with an on-prem deployment. It supports the idea that the model can be served, but the runtime contract matters a lot.

5. There are DGX Spark / Blackwell / NVFP4-oriented examples

There are also reports around DGX Spark, Blackwell GPUs, NVFP4, and Gemma-specific vLLM images:

DGX Spark で Gemma 4 26B-A4B / 31B Dense を vLLM で試す
Gemma 4 Models - which vLLM version? Any PRs spotted? - NVIDIA Developer Forums
Deploying Gemma 4 26B A4B on an RTX 5090
Finishing What We Started: Gemma 4 NVFP4 on vLLM, Desktop Blackwell, WSL2

These are encouraging, but they also reinforce the same point: exact combinations matter.

For example, some reports rely on:

vllm/vllm-openai:gemma4
vllm/vllm-openai:gemma4-cu130
nightly images
patched Gemma4 loader files
Blackwell-specific CUDA/container assumptions
NVFP4 or AWQ model repos
special MoE backend flags

So these examples are not “just pip install the latest vLLM and go.” They are better read as known-good or nearly-known-good stack recipes.

Why this is hard

The hard part is not only “large model serving.” It is several layers moving at once.

1. MoE is not the same as a small dense model

Gemma 4 26B-A4B is a MoE model. That changes the failure modes.

A dense model is conceptually simpler: all parameters are active on every forward pass. With MoE, each token is routed through a subset of experts, so you get lower active compute, but you also get:

expert routing
expert weight loading
MoE-specific kernels
MoE-specific quantization paths
different parallelism concerns
serving-engine support requirements

That is why “26B total / 4B active” is not equivalent to “just serve a 4B model.”

A useful mental model:

4B active helps compute.
26B total still affects weight memory and loader complexity.
MoE still affects kernels, routing, quantization, and parallelism.

2. “Server started” is not the same as “inference is stable”

There is a vLLM issue where Gemma 4 MoE 26B-A4B can load, capture CUDA graphs, and start the API server, but fail on the first inference request when using data parallelism:

Gemma 4 MoE (26B-A4B) crashes with --data-parallel-size > 1

That distinction matters a lot.

For this model, I would validate in stages:

model download works
tokenizer/config load works
weights load
CUDA graph capture completes
API server starts
/v1/models works
first /v1/chat/completions works
repeated requests work
concurrent requests work
longer context works
streaming works
tool calling works
reasoning parser works
structured output works

Those are not the same test.

3. Runtime MXFP4 has had Gemma 4 MoE-specific failures

There is a vLLM issue for runtime MXFP4 quantization:

Gemma 4 MoE (26B-A4B) — runtime MXFP4 quantization crashes during weight loading

The short version: --quantization mxfp4 crashes during weight loading for the Gemma 4 MoE model, even with --data-parallel-size 1.

This is a good example of why “just quantize it” is not enough. Quantization support has to exist at the exact intersection of:

model architecture
MoE loader
weight tensor layout
quantization format
kernel backend
vLLM version
CUDA / GPU generation
model repo format

So I would avoid runtime MXFP4 as a first attempt.

4. AWQ / NVFP4 support can be tag-dependent

There is also an issue suggesting AWQ behavior is tag-sensitive:

Gemma 4 26B-A4B AWQ INT4: latest v0.19.1 broken, latest gemma4 DockerHub tag works

This is exactly the kind of issue that makes “install latest stable” a weak strategy. Sometimes the model-specific image has the right patch; sometimes nightly has the patch but introduces a different breakage; sometimes the release tag is behind one model-specific PR.

So the safer approach is:

model repo + vLLM image + CUDA version + quant format + flags

as one unit.

Not:

latest model + latest vLLM + latest transformers + latest CUDA + random quant

5. GGUF is not a universal vLLM path

There are discussions about Gemma 4 GGUF not working cleanly with vLLM:

Unsloth Gemma 4 26B-A4B-it GGUF discussion: vLLM failure

I would keep the runtime lanes separate:

GGUF        -> llama.cpp / Ollama / LM Studio style runtimes
vLLM        -> HF safetensors or vLLM-supported quantized repos
Transformers -> current model docs, tokenizer, processor, model class support

A model being on Hugging Face does not mean every file format is equally natural for every runtime.

6. Tool calling / reasoning / structured output are separate validation layers

vLLM’s Gemma 4 guide mentions tool-use protocol and reasoning support, but that does not mean every tool parser / reasoning parser / structured output path is equally mature.

Relevant links:

Gemma 4 Usage Guide - vLLM Recipes
Gemma4 parser / reasoning / tool-call related issues in vLLM
Model seems to have issues in vLLM, characters duplication - Gemma 4 discussion

I would not test the model first with an agent framework, auto-tool-choice, streaming, structured output, and a quantized MoE backend all at once.

Plain chat first. Then one feature at a time.

7. Long context is a serving-memory problem, not just a model-card feature

vLLM exists partly because KV cache management is a major bottleneck in LLM serving.

Useful background:

Efficient Memory Management for Large Language Model Serving with PagedAttention

The PagedAttention paper explains why high-throughput serving struggles when KV cache memory is huge, grows and shrinks dynamically, and is wasted through fragmentation or duplication.

That matters here because --max-model-len directly changes KV-cache pressure. A model supporting long context does not mean your deployment can serve long context at your target concurrency.

This is why Google’s GKE example starting at 16K is meaningful.

Comparison with Qwen3.5 / Transformers v5 migration pain

I think Gemma 4 is in the same broad ecosystem migration zone as Qwen3.5, but the failure modes are somewhat different.

Qwen3.5 exposes the Transformers v5 boundary more directly.

Useful examples:

vLLM issue: Qwen3.5 Model Tokenizer Loading Failure
Unsloth issue: Qwen3.5 requires transformers>=5, but current vLLM dependency requires <5
Qwen3.5 HF discussion: Transformers does not recognize qwen3_5
Unsloth Qwen3.5 Fine-tuning Guide

The vLLM Qwen3.5 issue is especially clear: the model config expects Transformers >=5.0, while the vLLM image still requires Transformers >=4.56.0,<5.0.0. That creates a direct architecture/tokenizer recognition failure.

Gemma 4 seems different. It has more official serving paths already, including vLLM recipes and cloud/Kubernetes examples. But the pain shifts toward:

MoE serving
quantization loaders
model-specific vLLM tags
CUDA/Blackwell compatibility
KV cache sizing
tool/reasoning parser behavior
Kubernetes runtime contract
container vs native install differences

So I would summarize the contrast like this:

Qwen3.5:
  more obvious Transformers v5 architecture/tokenizer dependency pain.

Gemma 4:
  more official serving paths, but more practical pain around vLLM MoE serving,
  quantization, parser behavior, context length, and deployment runtime.

Why “latest everything” is risky

Transformers v5 and huggingface_hub v1.0 are also part of the background.

Useful links:

Transformers v5 Migration Guide
huggingface_hub v1.0 migration guide
Hugging Face Hub v1.0 announcement

The Transformers v5 migration guide notes that v5 pins huggingface_hub>=1.0.0. The Hub v1.0 migration brings its own changes, including a Python floor and HTTP stack migration to httpx.

So a “simple” model support upgrade can pull on:

Transformers version
huggingface_hub version
Python version
HTTP client behavior
downstream libraries pinned to Transformers 4.x
vLLM images with pinned dependency ranges
training/fine-tuning libraries
inference endpoint containers
notebook or hosted runtime images

This is why the issue often feels larger than the model itself.

My recommended debugging / deployment path

I would reduce the problem aggressively.

Phase 1: plain known-good serving

Start with:

- Gemma-specific vLLM image or official recipe
- BF16 if memory allows
- 1 GPU first
- data-parallel-size=1
- max-model-len=4096 or 16384
- no GGUF
- no runtime MXFP4
- no tool calling
- no reasoning parser
- no structured output
- no agent framework
- no Kubernetes rollout complexity if possible

Validate:

1. container starts
2. model downloads
3. weights load
4. API server starts
5. /v1/models works
6. minimal /v1/chat/completions works
7. repeated minimal requests work

Phase 2: increase context

Only after plain serving works:

- raise max-model-len to 16K
- test memory pressure
- test repeated requests
- test concurrency
- watch KV cache behavior

Do not jump straight to 128K.

Phase 3: add throughput features

Then test:

- prefix caching
- chunked prefill
- async scheduling
- FP8 KV cache
- max-num-seqs
- batching behavior

Phase 4: add quantization

Then try one quantization path at a time:

- known working AWQ repo + matching vLLM image
- known working NVFP4 repo + matching vLLM image
- avoid runtime MXFP4 until the exact issue status is clear

Keep a matrix:

model repo:
vLLM image:
CUDA version:
driver:
GPU:
quant:
flags:
max-model-len:
works / fails:
failure stage:

Phase 5: add tools / reasoning

Only after plain chat and concurrency are stable:

- --reasoning-parser gemma4
- --enable-auto-tool-choice
- --tool-call-parser gemma4
- streaming
- structured output
- agent framework integration

Validate these separately. A working plain chat endpoint does not prove a working tool-calling endpoint.

What I would tell someone struggling with this

I would say:

Your difficulties look normal for this model/runtime boundary. Gemma 4 26B-A4B has real working deployment paths, but the stable examples usually pin the stack tightly and start conservatively. I would not debug this as one bug. I would split it into architecture support, vLLM version, quantization path, CUDA/GPU compatibility, context length, Kubernetes runtime behavior, and parser/tooling behavior.

And:

For a first stable run, avoid native installs, avoid GGUF-on-vLLM, avoid runtime MXFP4, avoid data parallelism, avoid long context, and avoid tool/reasoning parsers. Start with a Gemma-specific vLLM container, BF16 if memory allows, data-parallel-size=1, max-model-len=4096 or 16384, and plain chat completions. Then add one feature at a time.

Short conclusion

Gemma 4 26B-A4B is not vaporware and not unsupported. There are official recipes, cloud/Kubernetes guides, Red Hat guidance, and community/GB10/Blackwell reports showing that it can run.

But it is also not a casual drop-in model.

The difficulty comes from the intersection of:

MoE architecture
26B total weights despite 4B active compute
long-context KV-cache pressure
vLLM support freshness
quantization format maturity
CUDA / Blackwell / container compatibility
Transformers v5-era dependency movement
tool-use / reasoning parser maturity
Kubernetes runtime details

So the safest summary is:

Gemma 4 26B-A4B is deployable, but not casually deployable. Use a known-good stack, start small, prove plain chat first, then add context, quantization, parallelism, and tool/reasoning features one at a time.