{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreifgcwgbintur5ootvuaiwrmravqi7n3bmezeoesypne7ed5n5dbn4",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mlvwwj2vwnp2"
},
"path": "/t/need-english-only-or-minimal-multilingual-2b-4b-llm-for-agentic-ai-on-gtx-1660-super-6gb-vram-quantization-friendly/176044#post_1",
"publishedAt": "2026-05-15T17:46:07.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "I’m building an Agentic AI application with very limited hardware: **GTX 1660 Super (Turing, 6GB VRAM)**. I plan to run a single LLM per agent (not multiple models simultaneously) to stay within VRAM limits.\n\n**What I’ve tried so far:**\n\n * `llama-3.2-3b-instruct` (4-bit) → poor results\n\n * `SmolLM3-3B` (no quantization) → good results but saturates 6GB VRAM, nothing left for computation\n\n * `SmolLM3-3B` (4-bit) → better than Llama, but still not good enough for my needs\n\n * Planning to test `Qwen3-4B-Thinking` and `Phi-3-mini-128k-instruct` next\n\n\n\n\n**My problem:** All these models are multilingual. That’s overkill for my use case. I suspect those extra language capabilities waste parameter capacity and VRAM that could otherwise improve English performance or reduce model size.\n\n**My request:** Can you recommend a **2B–4B parameter LLM that is English-only (or max 2–3 languages)** and works well with 4-bit or 8-bit quantization on 6GB VRAM? I’m looking for something that prioritizes English instruction-following, reasoning, and agentic tasks (tool use, planning, memory) over multilingual coverage.\n\n**Bonus points if:**\n\n * The model is known to be quantization-friendly (GPTQ, AWQ, or llama.cpp compatible)\n\n * There are quantized versions available on HF already\n\n * It has good benchmark scores (MMLU, GSM8K) compared to SmolLM3 or Llama-3.2-3B\n\n\n\n\n**What I don’t need:**\n\n * Translation capabilities\n\n * Support for non-Latin scripts\n\n * Massive vocabulary covering rare Unicode characters\n\n\n\n\nThank you!",
"title": "Need English-only (or minimal multilingual) 2B-4B LLM for Agentic AI on GTX 1660 Super (6GB VRAM) – quantization friendly"
}