Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih7w4pxybvaclrgyhbi5zi6sjmoyvttlen23kawpql4wdx5v7k73e",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnwau5npd6o2"
  },
  "path": "/t/deepseek-qwen/176657#post_3",
  "publishedAt": "2026-06-10T06:04:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "deepseek-ai/DeepSeek-V4-Flash",
    "NVIDIA H200",
    "vLLM recipe: DeepSeek V4 Flash",
    "DeepSeek V4 Flash",
    "Qwen3.5-122B-A10B",
    "Qwen3.6-35B-A3B",
    "MiniMax-M2",
    "Hugging Face GGUF docs",
    "vLLM Quantized KV Cache",
    "Hugging Face Models",
    "Hugging Face Leaderboards docs",
    "Hugging Face Evaluation Results",
    "LiveBench",
    "LM Arena",
    "Artificial Analysis LLM Leaderboard",
    "Gemma 4 26B-A4B / 31B",
    "Qwen2.5-Coder-32B-Instruct",
    "Mistral Small 3.2 24B",
    "DeepSeek-R1-Distill-Qwen-32B",
    "Qwen2.5-72B-Instruct",
    "GPT-OSS-120B",
    "Mistral Medium 3.5 128B",
    "Qwen3-235B-A22B",
    "Llama 4 Scout",
    "HF GGUF docs",
    "HF Local Apps",
    "Ollama on HF",
    "llama.cpp quantization README",
    "Qwen llama.cpp quantization guide",
    "llama.cpp MoE offload guide",
    "ik_llama.cpp hybrid CPU/GPU inference",
    "Unsloth requirements",
    "What model should I use?",
    "Saving to GGUF",
    "Connect llama.cpp to Unsloth",
    "promptfoo",
    "LangSmith Evaluation"
  ],
  "textContent": "Well. That ChatGPT conclusion may not be unreasonable. In simple terms, if you mean running that model on a single H200 GPU, the model is probably **too large for the available VRAM**. System RAM can be used as an escape hatch, but you should not expect it to be fast. So the model may run, but it may also be **extremely slow** :\n\n* * *\n\n## Short answer\n\nIf the machine is literally **1×H200 GPU + 2TB system RAM** , I would not start with **DeepSeek V4 Flash** as the first practical model.\n\nI would treat it as an **advanced experiment** , not as the default recommendation.\n\nThe model itself may be good. The problem is fit. The official model card describes DeepSeek V4 Flash as a **284B total / 13B active MoE model** with **1M context** :\n\ndeepseek-ai/DeepSeek-V4-Flash\n\nA single H200 is very strong, but it is still a single GPU with about **141GB HBM3e** :\n\nNVIDIA H200\n\nSo I would separate the cases like this:\n\nHardware interpretation | Practical meaning\n---|---\n**1×H200 + 2TB system RAM** | DeepSeek V4 Flash may be possible with quantization/offload, but I would expect it to be slow or backend-sensitive.\n**8×H200 node + 2TB system RAM** | DeepSeek V4 Flash becomes much more natural as a vLLM/SGLang-style serving target.\n**1×H200 with GGUF/llama.cpp-style offload** | Interesting for experiments, but speed and backend maturity become the main questions.\n\nThe vLLM recipe for DeepSeek V4 Flash shows an H200 example around an **8-GPU H200 node** with prefill/decode splitting:\n\nvLLM recipe: DeepSeek V4 Flash\n\nThat does not mean one H200 is useless. It just means that **“the model can run somewhere”** and **“this is a comfortable first model for one H200”** are different statements.\n\n## “Runs” is not the same as “runs well”\n\nFor one H200, I would not only ask:\n\n> Can the model load?\n\nI would ask:\n\n> Can the model give acceptable latency, throughput, context length, stability, and quality for the actual workload?\n\nThose are different questions.\n\nSystem RAM helps with **capacity**. It does not magically turn CPU RAM into HBM. If the runtime constantly has to move weights, experts, or cache data between CPU RAM and GPU memory, generation can become transfer-bound.\n\n2TB system RAM helps with… | But it does not automatically solve…\n---|---\nHolding huge quantized weights in memory | GPU execution speed\nCPU offload experiments | CPU-GPU transfer bottlenecks\nTrying multiple models or quant levels | Low-latency serving\nRAG, preprocessing, and evaluation datasets | Token generation speed\nMoE expert offload experiments | Backend maturity issues\n\nA useful short version is:\n\n> System RAM helps capacity and experimentation much more than raw generation speed.\n\n## Do not confuse MoE active parameters with dense model size\n\nThis is a common trap.\n\nWhen a MoE model says **13B active** , that does **not** mean it has the same memory requirement as a 13B dense model.\n\nFor MoE models:\n\n  * **active parameters** are closer to per-token compute cost\n  * **total parameters** are closer to model residency / storage / offload planning\n  * non-active experts still need to live somewhere\n  * expert placement and routing matter a lot\n  * backend support matters a lot\n\n\n\nModel | Total params | Active params | Practical warning\n---|---|---|---\nDeepSeek V4 Flash | 284B | 13B | Not a 13B memory problem.\nQwen3.5-122B-A10B | 122B | 10B | More practical, but still not a 10B memory problem.\nQwen3.6-35B-A3B | 35B | 3B | Much more natural as a first single-H200 MoE candidate.\nMiniMax-M2 | 229.9B | 9.8B | Interesting, but still a large-MoE/offload/backend experiment.\n\nThe rule I would use is:\n\n> Active parameters are a compute signal, not a complete VRAM estimate.\n\nFor memory planning, also check:\n\nFactor | Why it matters\n---|---\n**Total parameters** | Determines how much weight data must live somewhere.\n**Quantization format** | Changes memory footprint, speed, and quality.\n**KV cache** | Can dominate memory use at long context.\n**Context length** | 8K, 64K, 128K, and 1M are very different deployment problems.\n**Batch / concurrency** | Serving one user and serving many users are different.\n**Backend support** | New models can have missing operators, special attention, or immature kernels.\n**Offload behavior** | CPU RAM can save capacity, but transfer can kill speed.\n\n## Use 4-bit estimates, but treat them as a lower bound\n\nFor current local OSS LLM use, I would usually size models assuming **good 4-bit weight quantization** first.\n\nThat is more realistic than assuming BF16/FP16 for every local deployment.\n\nBut a 4-bit sizing table is still only a first-pass estimate. It is not a guarantee.\n\nUseful reference:\n\nHugging Face GGUF docs\n\nA good warning is:\n\n> The table below assumes good 4-bit weight quantization and moderate context length. It does not fully include KV cache, batching, CUDA/workspace overhead, backend buffers, or long-context serving costs.\n\nModel scale | 1×H200 practicality, assuming good 4-bit weights | Comment\n---|---|---\n**7B–14B dense** | Very easy | Fast, but probably too small if you want to exploit an H200.\n**24B–40B dense/MoE** | Excellent first target | Good quality/speed range; practical baseline.\n**70B dense** | Very realistic | Natural use of a large single GPU.\n**100B–130B dense/MoE** | Upper practical range | Worth testing; KV cache and context length matter.\n**200B–300B total MoE** | Advanced / experimental | Possible in some setups, but do not assume it will be fast.\n**400B+ total MoE** | Usually not a first single-H200 target | May run with heavy offload, but “usable” depends heavily on backend and tolerance for low tokens/sec.\n**1T-class MoE** | Watchlist / joke / special case | Interesting, but not where I would start on one H200.\n\nFor this setup, I would probably test in this order:\n\nOrder | Size range | Goal\n---|---|---\n1 | **24B–40B** | Fast baseline with modern models.\n2 | **70B** | Strong large-single-GPU baseline.\n3 | **100B–130B** | Upper practical range.\n4 | **200B+ MoE** | Only after baseline latency/quality is known.\n\n## Quantization is practical, but not magic\n\n4-bit quantization is often the practical default for large local models. But it still trades off memory, speed, and quality.\n\nThe quality loss is often small enough to be acceptable for large models, especially with good formats. But it is not literally zero.\n\nIt can matter more for:\n\n  * math\n  * code\n  * strict JSON/tool calling\n  * long reasoning chains\n  * small models\n  * difficult instruction following\n  * tasks where small logit differences matter\n\n\n\nSpeed is also not automatic. Quantization can speed things up by reducing memory bandwidth and allowing the model to fit on GPU. But some formats require dequantization or special kernels, and performance depends on backend implementation.\n\nQuant level | Practical meaning\n---|---\n**Q8 / FP8 / 8-bit** | Quality-oriented if memory allows.\n**Q6 / Q5** | Good quality/capacity balance.\n**Q4** | Practical default for many large local models.\n**Q3** | Sometimes acceptable for large models; test quality.\n**Q2 / ~2-bit** | Emergency or experiment zone.\n**IQ1 / ~1.5–1.8 bpw** | Funny but real; not a normal first recommendation.\n**BitNet b1.58-style models** | Separate low-bit-native architecture/training direction, not ordinary post-training quantization.\n\nAs a small quantization joke: yes, 1-bit and 2-bit quants exist. If the alternative is “the model does not fit at all,” 1.5–2 bit can sometimes be useful. But I would not use those as the normal recommendation. I would size the machine around good 4-bit weights first.\n\n## Long context changes the memory math\n\nModel weights are only one part of VRAM use.\n\nLong context can make **KV cache** a major memory consumer.\n\nA model that fits at 8K context may not be comfortable at 64K, 128K, or 1M context. This is especially important for models that advertise very long context.\n\nFor DeepSeek V4 Flash, **“supports 1M context”** and **“I can serve 1M context comfortably on one H200”** are very different statements.\n\nvLLM has documentation on quantized KV cache:\n\nvLLM Quantized KV Cache\n\nThat page is useful because it highlights the point: KV cache is important enough that people quantize it separately.\n\nWhen comparing models, I would track:\n\nMetric | Why\n---|---\n**VRAM used** | Shows whether the model actually fits with your settings.\n**CPU RAM used** | Shows how much offload/caching is happening.\n**Time to first token** | Important for UX and serving latency.\n**Generation tok/s** | Important for actual output speed.\n**Prompt tok/s** | Important for long-context workloads.\n**Max context tested** | Prevents misleading “it fits at 8K” conclusions.\n\n## Backend maturity matters, especially for new models\n\nA model can have valid weights and still be annoying to run.\n\nThis happens often with very new models.\n\nPossible issue | What to check\n---|---\nNew operators / attention patterns | vLLM, SGLang, Transformers, llama.cpp support\nMultimodal processors | Whether the backend supports the exact processor path\nSpecial chat template | Model card and tokenizer config\nSpecial response format | Example: GPT-OSS Harmony format\nGGUF still in progress | llama.cpp discussions / model repo notes\nMissing repo files or metadata | HF Files and community discussions\nBackend lag | Recent issues, PRs, and real user reports\n\nThis is why older models can be attractive. They may be less exciting, but the runtime path is usually safer.\n\n## How I would search for OSS LLMs today\n\nI would not choose a model by asking only “what is the best model?”\n\nI would use leaderboards and community attention to build a shortlist, then reject candidates that do not fit the runtime.\n\nUseful discovery links:\n\n  * Hugging Face Models\n  * Hugging Face Leaderboards docs\n  * Hugging Face Evaluation Results\n  * LiveBench\n  * LM Arena\n  * Artificial Analysis LLM Leaderboard\n\n\n\nMy search process would be:\n\nStep | Check\n---|---\n1 | Find active model families from HF, leaderboards, and community discussion.\n2 | Open the exact model card, not just a leaderboard row.\n3 | Check total params, active params, context length, and license.\n4 | Check whether the repo has the files you actually need.\n5 | Check vLLM / SGLang / GGUF / llama.cpp support.\n6 | Check recent issues and discussions.\n7 | Run your own small benchmark.\n\nLeaderboards are useful, but they are not the final answer. A high-ranking model can still be a bad fit if it is painful to run on your hardware.\n\n## Practical candidate families I would investigate on one H200\n\nI would not present this as a definitive ranking. The open-model landscape changes too quickly, and backend support matters a lot.\n\nBut if I had **1×H200 + 2TB RAM** , these are the kinds of model families I would personally investigate first.\n\n### First practical tests\n\nCandidate | Why I would look at it\n---|---\nGemma 4 26B-A4B / 31B | Newer, strong, and still in a practical size range. Check backend support because newer architecture features can matter.\nQwen3.6-35B-A3B | Very attractive size for one H200: 35B total / 3B active, with vLLM/SGLang/KTransformers compatibility noted on the model card.\nQwen2.5-Coder-32B-Instruct | Older, safer coding baseline; likely easier to run than very new models.\nMistral Small 3.2 24B | Practical 24B-class baseline; good first comparison point.\nDeepSeek-R1-Distill-Qwen-32B | Useful if reasoning is important and you want a 32B-class baseline.\n\n### Strong larger tests\n\nCandidate | Why I would look at it\n---|---\nQwen2.5-72B-Instruct | Older but strong and safe; good baseline for a large single GPU.\nGPT-OSS-120B | Very interesting for one H200 because it is documented as fitting into a single 80GB-class GPU. Make sure to use the required Harmony format.\nQwen3.5-122B-A10B | Larger modern MoE candidate; still more realistic than 200B–300B+ total MoE as a first large experiment.\nMistral Medium 3.5 128B | Dense 128B with long-context ambitions; interesting upper-range test for one H200 with quantization.\nLlama 70B-class baselines | Useful because Llama-compatible tooling is mature, especially for GGUF/llama.cpp-style workflows.\n\n### Advanced / only after smaller baselines\n\nCandidate | Why I would be careful\n---|---\nDeepSeek V4 Flash | Interesting model, but 284B total params makes it an offload/backend experiment on one H200.\nQwen3-235B-A22B | Large MoE; worth testing only after you know your latency/quality baseline.\nMiniMax-M2 | 229.9B total / 9.8B active; interesting agentic model, but still a large-MoE deployment experiment.\nLlama 4 Scout | Potentially interesting, but check exact backend support and memory behavior.\n\nI am intentionally not listing every exciting new frontier MoE model here. For example, GLM-5-class models may be interesting, but they are too large to be good “first practical candidates” for a single H200. I would rather list models that I would realistically test first.\n\n### Half-joke / watchlist\n\nItem | Why it is not my first practical target\n---|---\nKimi K2 / Kimi V2-class giant MoE models | Exciting, but I would not make a 1T-class MoE my first practical single-H200 target.\n1-bit / 2-bit quants | Real, funny, and sometimes useful, but I would treat them as emergency or experiment options.\n\n## Useful local inference references\n\nTopic | Links\n---|---\nGGUF / local apps | HF GGUF docs, HF Local Apps, Ollama on HF\nQuantization | llama.cpp quantization README, Qwen llama.cpp quantization guide\nMoE offload | llama.cpp MoE offload guide, ik_llama.cpp hybrid CPU/GPU inference\nUnsloth / GGUF export | Unsloth requirements, What model should I use?, Saving to GGUF, Connect llama.cpp to Unsloth\n\n## Build a tiny internal eval set\n\nPublic leaderboards are for shortlisting. For deployment, I would also make a small private eval set from real internal tasks. Even **20–50 carefully chosen cases** can be useful; promptfoo and LangSmith Evaluation are good references.\n\nCategory | Example | Score\n---|---|---\nSummarization | memo / meeting note | factuality, omissions, action items\nExtraction | emails / tickets / PDFs | exact match, JSON schema\nRAG QA | internal docs | faithfulness, citations\nLong context | largest realistic bundle | accuracy, latency, memory\nCoding / JSON | script or API payload | tests, schema, business rules\nRegression | previous failures | pass/fail + note\n\nRecord the same basics for every model: **backend, quant, context, VRAM, CPU RAM, tok/s, quality, failure mode**.\n\n## Bottom line\n\nI would not say DeepSeek V4 Flash is a bad model.\n\nI would say:\n\n> DeepSeek V4 Flash is probably too large to be the first practical target for one H200 if you care about speed and ease of deployment.\n\nIf this is **1×H200 + 2TB RAM** , I would start with models around:\n\nFirst | Then | Later\n---|---|---\nGemma 4 26B-A4B / 31B | Qwen2.5-72B | DeepSeek V4 Flash\nQwen3.6-35B-A3B | GPT-OSS-120B | Qwen3-235B-A22B\nQwen2.5-Coder-32B | Qwen3.5-122B-A10B | MiniMax-M2\nMistral Small 3.2 24B | Mistral Medium 3.5 128B | other large MoE models\n\nThe main lesson is:\n\n> Do not choose open LLMs by leaderboard rank or active parameter count alone. Choose them by matching model architecture, total size, quantization, context length, KV cache, backend support, and hardware reality.",
  "title": "Deepseek? Qwen?"
}