Need English-only (or minimal multilingual) 2B-4B LLM for Agentic AI on GTX 1660 Super (6GB VRAM) – quantization friendly
I’m building an Agentic AI application with very limited hardware: GTX 1660 Super (Turing, 6GB VRAM). I plan to run a single LLM per agent (not multiple models simultaneously) to stay within VRAM limits.
What I’ve tried so far:
llama-3.2-3b-instruct(4-bit) → poor resultsSmolLM3-3B(no quantization) → good results but saturates 6GB VRAM, nothing left for computationSmolLM3-3B(4-bit) → better than Llama, but still not good enough for my needsPlanning to test
Qwen3-4B-ThinkingandPhi-3-mini-128k-instructnext
My problem: All these models are multilingual. That’s overkill for my use case. I suspect those extra language capabilities waste parameter capacity and VRAM that could otherwise improve English performance or reduce model size.
My request: Can you recommend a 2B–4B parameter LLM that is English-only (or max 2–3 languages) and works well with 4-bit or 8-bit quantization on 6GB VRAM? I’m looking for something that prioritizes English instruction-following, reasoning, and agentic tasks (tool use, planning, memory) over multilingual coverage.
Bonus points if:
The model is known to be quantization-friendly (GPTQ, AWQ, or llama.cpp compatible)
There are quantized versions available on HF already
It has good benchmark scores (MMLU, GSM8K) compared to SmolLM3 or Llama-3.2-3B
What I don’t need:
Translation capabilities
Support for non-Latin scripts
Massive vocabulary covering rare Unicode characters
Thank you!
Discussion in the ATmosphere