{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreidj5wpjyvvxi5bdykyia6muh2gpxjxtdcbb2p6yhb7wmehs637nu4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnvt3hdtffg2"
},
"path": "/t/deepseek-qwen/176657#post_2",
"publishedAt": "2026-06-10T03:31:16.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I’'m not an expert but I’m trying to answer.\n\nPersonally, DeepSeek won’t work well. For vLLM to be fast, the model weights and context (KV cache) must fit entirely in your GPU’s VRAM. Your H200 has 141GB of VRAM (don’t confuse this with your 2TB system RAM). DeepSeek v4 Flash is simply too massive. Even heavily quantized, the weights alone will eat up all 141GB, causing Out-Of-Memory (OOM) errors. Offloading to your 2TB system RAM will ruin your generation speeds.\n\nYou want models in the 70B-72B range. Since the H200 natively supports FP8, you can run these models at 8-bit precision. They will only take about ~70GB of VRAM, leaving you 70GB+ for a massive context window and insanely fast speeds.\n\nBest options for internal text apps:\n\n 1. Llama-3.1-70B-Instruct (or newer) - The gold standard for general text reasoning.\n\n 2. Qwen-2.5-72B-Instruct - Amazing if your tasks involve coding, data, or multilingual text.\n\n\n",
"title": "Deepseek? Qwen?"
}