Deepseek? Qwen?
Well. That ChatGPT conclusion may not be unreasonable. In simple terms, if you mean running that model on a single H200 GPU, the model is probably too large for the available VRAM. System RAM can be used as an escape hatch, but you should not expect it to be fast. So the model may run, but it may also be extremely slow :
Short answer
If the machine is literally 1×H200 GPU + 2TB system RAM , I would not start with DeepSeek V4 Flash as the first practical model.
I would treat it as an advanced experiment , not as the default recommendation.
The model itself may be good. The problem is fit. The official model card describes DeepSeek V4 Flash as a 284B total / 13B active MoE model with 1M context :
deepseek-ai/DeepSeek-V4-Flash
A single H200 is very strong, but it is still a single GPU with about 141GB HBM3e :
NVIDIA H200
So I would separate the cases like this:
| Hardware interpretation | Practical meaning |
|---|---|
| 1×H200 + 2TB system RAM | DeepSeek V4 Flash may be possible with quantization/offload, but I would expect it to be slow or backend-sensitive. |
| 8×H200 node + 2TB system RAM | DeepSeek V4 Flash becomes much more natural as a vLLM/SGLang-style serving target. |
| 1×H200 with GGUF/llama.cpp-style offload | Interesting for experiments, but speed and backend maturity become the main questions. |
The vLLM recipe for DeepSeek V4 Flash shows an H200 example around an 8-GPU H200 node with prefill/decode splitting:
vLLM recipe: DeepSeek V4 Flash
That does not mean one H200 is useless. It just means that “the model can run somewhere” and “this is a comfortable first model for one H200” are different statements.
“Runs” is not the same as “runs well”
For one H200, I would not only ask:
Can the model load?
I would ask:
Can the model give acceptable latency, throughput, context length, stability, and quality for the actual workload?
Those are different questions.
System RAM helps with capacity. It does not magically turn CPU RAM into HBM. If the runtime constantly has to move weights, experts, or cache data between CPU RAM and GPU memory, generation can become transfer-bound.
| 2TB system RAM helps with… | But it does not automatically solve… |
|---|---|
| Holding huge quantized weights in memory | GPU execution speed |
| CPU offload experiments | CPU-GPU transfer bottlenecks |
| Trying multiple models or quant levels | Low-latency serving |
| RAG, preprocessing, and evaluation datasets | Token generation speed |
| MoE expert offload experiments | Backend maturity issues |
A useful short version is:
System RAM helps capacity and experimentation much more than raw generation speed.
Do not confuse MoE active parameters with dense model size
This is a common trap.
When a MoE model says 13B active , that does not mean it has the same memory requirement as a 13B dense model.
For MoE models:
- active parameters are closer to per-token compute cost
- total parameters are closer to model residency / storage / offload planning
- non-active experts still need to live somewhere
- expert placement and routing matter a lot
- backend support matters a lot
| Model | Total params | Active params | Practical warning |
|---|---|---|---|
| DeepSeek V4 Flash | 284B | 13B | Not a 13B memory problem. |
| Qwen3.5-122B-A10B | 122B | 10B | More practical, but still not a 10B memory problem. |
| Qwen3.6-35B-A3B | 35B | 3B | Much more natural as a first single-H200 MoE candidate. |
| MiniMax-M2 | 229.9B | 9.8B | Interesting, but still a large-MoE/offload/backend experiment. |
The rule I would use is:
Active parameters are a compute signal, not a complete VRAM estimate.
For memory planning, also check:
| Factor | Why it matters |
|---|---|
| Total parameters | Determines how much weight data must live somewhere. |
| Quantization format | Changes memory footprint, speed, and quality. |
| KV cache | Can dominate memory use at long context. |
| Context length | 8K, 64K, 128K, and 1M are very different deployment problems. |
| Batch / concurrency | Serving one user and serving many users are different. |
| Backend support | New models can have missing operators, special attention, or immature kernels. |
| Offload behavior | CPU RAM can save capacity, but transfer can kill speed. |
Use 4-bit estimates, but treat them as a lower bound
For current local OSS LLM use, I would usually size models assuming good 4-bit weight quantization first.
That is more realistic than assuming BF16/FP16 for every local deployment.
But a 4-bit sizing table is still only a first-pass estimate. It is not a guarantee.
Useful reference:
Hugging Face GGUF docs
A good warning is:
The table below assumes good 4-bit weight quantization and moderate context length. It does not fully include KV cache, batching, CUDA/workspace overhead, backend buffers, or long-context serving costs.
| Model scale | 1×H200 practicality, assuming good 4-bit weights | Comment |
|---|---|---|
| 7B–14B dense | Very easy | Fast, but probably too small if you want to exploit an H200. |
| 24B–40B dense/MoE | Excellent first target | Good quality/speed range; practical baseline. |
| 70B dense | Very realistic | Natural use of a large single GPU. |
| 100B–130B dense/MoE | Upper practical range | Worth testing; KV cache and context length matter. |
| 200B–300B total MoE | Advanced / experimental | Possible in some setups, but do not assume it will be fast. |
| 400B+ total MoE | Usually not a first single-H200 target | May run with heavy offload, but “usable” depends heavily on backend and tolerance for low tokens/sec. |
| 1T-class MoE | Watchlist / joke / special case | Interesting, but not where I would start on one H200. |
For this setup, I would probably test in this order:
| Order | Size range | Goal |
|---|---|---|
| 1 | 24B–40B | Fast baseline with modern models. |
| 2 | 70B | Strong large-single-GPU baseline. |
| 3 | 100B–130B | Upper practical range. |
| 4 | 200B+ MoE | Only after baseline latency/quality is known. |
Quantization is practical, but not magic
4-bit quantization is often the practical default for large local models. But it still trades off memory, speed, and quality.
The quality loss is often small enough to be acceptable for large models, especially with good formats. But it is not literally zero.
It can matter more for:
- math
- code
- strict JSON/tool calling
- long reasoning chains
- small models
- difficult instruction following
- tasks where small logit differences matter
Speed is also not automatic. Quantization can speed things up by reducing memory bandwidth and allowing the model to fit on GPU. But some formats require dequantization or special kernels, and performance depends on backend implementation.
| Quant level | Practical meaning |
|---|---|
| Q8 / FP8 / 8-bit | Quality-oriented if memory allows. |
| Q6 / Q5 | Good quality/capacity balance. |
| Q4 | Practical default for many large local models. |
| Q3 | Sometimes acceptable for large models; test quality. |
| Q2 / ~2-bit | Emergency or experiment zone. |
| IQ1 / ~1.5–1.8 bpw | Funny but real; not a normal first recommendation. |
| BitNet b1.58-style models | Separate low-bit-native architecture/training direction, not ordinary post-training quantization. |
As a small quantization joke: yes, 1-bit and 2-bit quants exist. If the alternative is “the model does not fit at all,” 1.5–2 bit can sometimes be useful. But I would not use those as the normal recommendation. I would size the machine around good 4-bit weights first.
Long context changes the memory math
Model weights are only one part of VRAM use.
Long context can make KV cache a major memory consumer.
A model that fits at 8K context may not be comfortable at 64K, 128K, or 1M context. This is especially important for models that advertise very long context.
For DeepSeek V4 Flash, “supports 1M context” and “I can serve 1M context comfortably on one H200” are very different statements.
vLLM has documentation on quantized KV cache:
vLLM Quantized KV Cache
That page is useful because it highlights the point: KV cache is important enough that people quantize it separately.
When comparing models, I would track:
| Metric | Why |
|---|---|
| VRAM used | Shows whether the model actually fits with your settings. |
| CPU RAM used | Shows how much offload/caching is happening. |
| Time to first token | Important for UX and serving latency. |
| Generation tok/s | Important for actual output speed. |
| Prompt tok/s | Important for long-context workloads. |
| Max context tested | Prevents misleading “it fits at 8K” conclusions. |
Backend maturity matters, especially for new models
A model can have valid weights and still be annoying to run.
This happens often with very new models.
| Possible issue | What to check |
|---|---|
| New operators / attention patterns | vLLM, SGLang, Transformers, llama.cpp support |
| Multimodal processors | Whether the backend supports the exact processor path |
| Special chat template | Model card and tokenizer config |
| Special response format | Example: GPT-OSS Harmony format |
| GGUF still in progress | llama.cpp discussions / model repo notes |
| Missing repo files or metadata | HF Files and community discussions |
| Backend lag | Recent issues, PRs, and real user reports |
This is why older models can be attractive. They may be less exciting, but the runtime path is usually safer.
How I would search for OSS LLMs today
I would not choose a model by asking only “what is the best model?”
I would use leaderboards and community attention to build a shortlist, then reject candidates that do not fit the runtime.
Useful discovery links:
- Hugging Face Models
- Hugging Face Leaderboards docs
- Hugging Face Evaluation Results
- LiveBench
- LM Arena
- Artificial Analysis LLM Leaderboard
My search process would be:
| Step | Check |
|---|---|
| 1 | Find active model families from HF, leaderboards, and community discussion. |
| 2 | Open the exact model card, not just a leaderboard row. |
| 3 | Check total params, active params, context length, and license. |
| 4 | Check whether the repo has the files you actually need. |
| 5 | Check vLLM / SGLang / GGUF / llama.cpp support. |
| 6 | Check recent issues and discussions. |
| 7 | Run your own small benchmark. |
Leaderboards are useful, but they are not the final answer. A high-ranking model can still be a bad fit if it is painful to run on your hardware.
Practical candidate families I would investigate on one H200
I would not present this as a definitive ranking. The open-model landscape changes too quickly, and backend support matters a lot.
But if I had 1×H200 + 2TB RAM , these are the kinds of model families I would personally investigate first.
First practical tests
| Candidate | Why I would look at it |
|---|---|
| Gemma 4 26B-A4B / 31B | Newer, strong, and still in a practical size range. Check backend support because newer architecture features can matter. |
| Qwen3.6-35B-A3B | Very attractive size for one H200: 35B total / 3B active, with vLLM/SGLang/KTransformers compatibility noted on the model card. |
| Qwen2.5-Coder-32B-Instruct | Older, safer coding baseline; likely easier to run than very new models. |
| Mistral Small 3.2 24B | Practical 24B-class baseline; good first comparison point. |
| DeepSeek-R1-Distill-Qwen-32B | Useful if reasoning is important and you want a 32B-class baseline. |
Strong larger tests
| Candidate | Why I would look at it |
|---|---|
| Qwen2.5-72B-Instruct | Older but strong and safe; good baseline for a large single GPU. |
| GPT-OSS-120B | Very interesting for one H200 because it is documented as fitting into a single 80GB-class GPU. Make sure to use the required Harmony format. |
| Qwen3.5-122B-A10B | Larger modern MoE candidate; still more realistic than 200B–300B+ total MoE as a first large experiment. |
| Mistral Medium 3.5 128B | Dense 128B with long-context ambitions; interesting upper-range test for one H200 with quantization. |
| Llama 70B-class baselines | Useful because Llama-compatible tooling is mature, especially for GGUF/llama.cpp-style workflows. |
Advanced / only after smaller baselines
| Candidate | Why I would be careful |
|---|---|
| DeepSeek V4 Flash | Interesting model, but 284B total params makes it an offload/backend experiment on one H200. |
| Qwen3-235B-A22B | Large MoE; worth testing only after you know your latency/quality baseline. |
| MiniMax-M2 | 229.9B total / 9.8B active; interesting agentic model, but still a large-MoE deployment experiment. |
| Llama 4 Scout | Potentially interesting, but check exact backend support and memory behavior. |
I am intentionally not listing every exciting new frontier MoE model here. For example, GLM-5-class models may be interesting, but they are too large to be good “first practical candidates” for a single H200. I would rather list models that I would realistically test first.
Half-joke / watchlist
| Item | Why it is not my first practical target |
|---|---|
| Kimi K2 / Kimi V2-class giant MoE models | Exciting, but I would not make a 1T-class MoE my first practical single-H200 target. |
| 1-bit / 2-bit quants | Real, funny, and sometimes useful, but I would treat them as emergency or experiment options. |
Useful local inference references
| Topic | Links |
|---|---|
| GGUF / local apps | HF GGUF docs, HF Local Apps, Ollama on HF |
| Quantization | llama.cpp quantization README, Qwen llama.cpp quantization guide |
| MoE offload | llama.cpp MoE offload guide, ik_llama.cpp hybrid CPU/GPU inference |
| Unsloth / GGUF export | Unsloth requirements, What model should I use?, Saving to GGUF, Connect llama.cpp to Unsloth |
Build a tiny internal eval set
Public leaderboards are for shortlisting. For deployment, I would also make a small private eval set from real internal tasks. Even 20–50 carefully chosen cases can be useful; promptfoo and LangSmith Evaluation are good references.
| Category | Example | Score |
|---|---|---|
| Summarization | memo / meeting note | factuality, omissions, action items |
| Extraction | emails / tickets / PDFs | exact match, JSON schema |
| RAG QA | internal docs | faithfulness, citations |
| Long context | largest realistic bundle | accuracy, latency, memory |
| Coding / JSON | script or API payload | tests, schema, business rules |
| Regression | previous failures | pass/fail + note |
Record the same basics for every model: backend, quant, context, VRAM, CPU RAM, tok/s, quality, failure mode.
Bottom line
I would not say DeepSeek V4 Flash is a bad model.
I would say:
DeepSeek V4 Flash is probably too large to be the first practical target for one H200 if you care about speed and ease of deployment.
If this is 1×H200 + 2TB RAM , I would start with models around:
| First | Then | Later |
|---|---|---|
| Gemma 4 26B-A4B / 31B | Qwen2.5-72B | DeepSeek V4 Flash |
| Qwen3.6-35B-A3B | GPT-OSS-120B | Qwen3-235B-A22B |
| Qwen2.5-Coder-32B | Qwen3.5-122B-A10B | MiniMax-M2 |
| Mistral Small 3.2 24B | Mistral Medium 3.5 128B | other large MoE models |
The main lesson is:
Do not choose open LLMs by leaderboard rank or active parameter count alone. Choose them by matching model architecture, total size, quantization, context length, KV cache, backend support, and hardware reality.
Discussion in the ATmosphere