External Publication

Need English-only (or minimal multilingual) 2B-4B LLM for Agentic AI on GTX 1660 Super (6GB VRAM) – quantization friendly

Hugging Face Forums [Unofficial] May 15, 2026

I’m building an Agentic AI application with very limited hardware: GTX 1660 Super (Turing, 6GB VRAM). I plan to run a single LLM per agent (not multiple models simultaneously) to stay within VRAM limits.

What I’ve tried so far:

llama-3.2-3b-instruct (4-bit) → poor results
SmolLM3-3B (no quantization) → good results but saturates 6GB VRAM, nothing left for computation
SmolLM3-3B (4-bit) → better than Llama, but still not good enough for my needs
Planning to test Qwen3-4B-Thinking and Phi-3-mini-128k-instruct next

My problem: All these models are multilingual. That’s overkill for my use case. I suspect those extra language capabilities waste parameter capacity and VRAM that could otherwise improve English performance or reduce model size.

My request: Can you recommend a 2B–4B parameter LLM that is English-only (or max 2–3 languages) and works well with 4-bit or 8-bit quantization on 6GB VRAM? I’m looking for something that prioritizes English instruction-following, reasoning, and agentic tasks (tool use, planning, memory) over multilingual coverage.

Bonus points if:

The model is known to be quantization-friendly (GPTQ, AWQ, or llama.cpp compatible)
There are quantized versions available on HF already
It has good benchmark scores (MMLU, GSM8K) compared to SmolLM3 or Llama-3.2-3B

What I don’t need:

Translation capabilities
Support for non-Latin scripts
Massive vocabulary covering rare Unicode characters

Thank you!

Discussion in the ATmosphere