Using a Hugging Face Model offline to support code generation in VSCode
Plugins don’t seem to work very well with autocomplete features.
Also, regardless of the model size, when handling long context lengths with an LLM, if you don’t choose the right attention backend, performance can drop to absurdly slow levels… This is likely to be a problem for coding tasks. And if you’re using an older generation of GPUs, you may have fewer options for attention backends.
The highest-leverage improvements are on the prompt path , model-role split , and memory settings.
1. Make autocomplete the first success target
In Continue, rules are included in Agent, Chat, and Edit , but not in autocomplete or apply. Continue also currently recommends QwenCoder2.5 1.5B and QwenCoder2.5 7B as strong open autocomplete models. That makes autocomplete much lighter than chat on a small local machine. (Continue Docs)
That means your first target should be:
- fast inline completion
- small prompt window
- small output
- no extra repo context at first
Only after that is stable should you optimize chat/edit.
2. Split chat and autocomplete into separate model roles
Continue supports model roles such as chat, edit, and autocomplete, with separate settings for each. For your setup, that is the cleanest architecture. Use one model for chat/edit and a smaller, faster one for autocomplete. (Continue Docs)
A practical pattern is:
- chat/edit : your current instruct model
- autocomplete : a smaller coder model such as QwenCoder2.5 1.5B
That avoids forcing one local model to satisfy two very different latency targets. (Continue Docs)
3. Shrink the base system prompt hard
Continue’s config supports baseSystemMessage, and its rules system appends rules into the system message for Chat, Agent, and Edit. So if the local model is slow, the fastest gain usually comes from replacing the default long instruction block with something minimal and removing extra rules until the loop is stable. (Continue Docs)
A good first-pass chat system prompt is just:
baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."
That is not magic. It just cuts prompt weight.
4. Cap prompt and output size aggressively
Continue exposes the exact settings you need:
defaultCompletionOptions.contextLengthdefaultCompletionOptions.maxTokensrequestOptions.timeoutautocompleteOptions.maxPromptTokensautocompleteOptions.modelTimeoutautocompleteOptions.onlyMyCodeautocompleteOptions.useImportsautocompleteOptions.useRecentlyEditedautocompleteOptions.useRecentlyOpened(Continue Docs)
For a first stable setup, I would start around:
- chat/edit
contextLength: 4096 - chat/edit
maxTokens: 128or256 - autocomplete
maxPromptTokens: 256to384 - autocomplete
maxTokens: 32to64 onlyMyCode: trueuseImports: falseuseRecentlyEdited: falseuseRecentlyOpened: false
Those numbers are my recommendation, not a Continue default. The reason is simple: prompt growth is usually what turns “slow but usable” into “appears broken.”
5. Increase timeouts before judging the setup
Continue supports request-level timeout controls and autocomplete timeout controls in config. If the model is working but slow, short client timeouts can make a valid local setup look dead. (Continue Docs)
A reasonable first pass is:
requestOptions:
timeout: 180000
and for autocomplete:
autocompleteOptions:
modelTimeout: 12000
That gives the local model more room while still keeping inline completion from hanging forever.
6. If you stay on Ollama, use the memory-saving switches
Ollama documents two settings that matter a lot for long context on small VRAM:
OLLAMA_FLASH_ATTENTION=1OLLAMA_KV_CACHE_TYPE=q8_0orq4_0(Ollama)
Ollama states that Flash Attention can significantly reduce memory usage as context grows, and that KV-cache quantization reduces memory further. It documents q8_0 as using about half the memory of f16 with small quality loss, while q4_0 uses about one quarter with more noticeable degradation. (Ollama)
So the safest first server setting is:
OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
Then only move to q4_0 if memory is still too tight.
7. Keep context modest first
Ollama’s context-length docs say that on systems with under 24 GiB VRAM , the default context is 4K , and that larger context needs more memory. They also recommend using ollama ps to check whether the model remains fully on GPU or gets partially offloaded to CPU. (Ollama)
So on a 6 GB class machine, I would not begin with 32K or 64K. I would test in this order:
4096819216384
and stop increasing once ollama ps shows CPU offload or latency becomes unusable. That step-up sequence is my recommendation, based on Ollama’s documented VRAM guidance. (Ollama)
8. Use a cleaner Continue config
A minimal local config for your setup could look like this:
name: Local HF
version: 1.0.0
schema: v1
models:
- name: local-chat
provider: openai
apiBase: http://127.0.0.1:8000/v1
model: qwen2.5-coder-3b-instruct
roles: [chat, edit]
baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."
defaultCompletionOptions:
contextLength: 4096
maxTokens: 256
temperature: 0.2
requestOptions:
timeout: 180000
- name: local-autocomplete
provider: openai
apiBase: http://127.0.0.1:8000/v1
model: qwen2.5-coder-1.5b-instruct
roles: [autocomplete]
autocompleteOptions:
debounceDelay: 400
maxPromptTokens: 384
modelTimeout: 12000
onlyMyCode: true
useImports: false
useRecentlyEdited: false
useRecentlyOpened: false
defaultCompletionOptions:
temperature: 0.1
maxTokens: 64
Every field used there is documented by Continue’s config reference. The exact values are tuned for a small local setup. (Continue Docs)
9. Keep Agent mode off until the basics are solid
Continue’s agent flow sends tools along with chat requests and can loop through tool calls and tool results. That makes the prompt path heavier and more complex than plain chat or autocomplete. On constrained local hardware, agent mode is a later step, not the first one. (Continue Docs)
10. My priority order
If I were tuning your setup, I would do it in this order:
- Get autocomplete working fast with a small model. (Continue Docs)
- Strip chat down to a tiny base system message and no extra rules. (Continue Docs)
- Keep context at 4K–8K first. (Ollama)
- Turn on Flash Attention and
q8_0KV cache if using Ollama. (Ollama) - Increase timeout before concluding the integration is failing. (Continue Docs)
- Only then add larger context, more rules, or agent features. (Continue Docs)
The main idea is simple: reduce the problem from “local IDE assistant” to “one small model, one small prompt, one fast feature.”
Discussion in the ATmosphere