External Publication
Visit Post

Using a Hugging Face Model offline to support code generation in VSCode

Hugging Face Forums [Unofficial] March 28, 2026
Source

Plugins don’t seem to work very well with autocomplete features.

Also, regardless of the model size, when handling long context lengths with an LLM, if you don’t choose the right attention backend, performance can drop to absurdly slow levels… This is likely to be a problem for coding tasks. And if you’re using an older generation of GPUs, you may have fewer options for attention backends.


The highest-leverage improvements are on the prompt path , model-role split , and memory settings.

1. Make autocomplete the first success target

In Continue, rules are included in Agent, Chat, and Edit , but not in autocomplete or apply. Continue also currently recommends QwenCoder2.5 1.5B and QwenCoder2.5 7B as strong open autocomplete models. That makes autocomplete much lighter than chat on a small local machine. (Continue Docs)

That means your first target should be:

  • fast inline completion
  • small prompt window
  • small output
  • no extra repo context at first

Only after that is stable should you optimize chat/edit.

2. Split chat and autocomplete into separate model roles

Continue supports model roles such as chat, edit, and autocomplete, with separate settings for each. For your setup, that is the cleanest architecture. Use one model for chat/edit and a smaller, faster one for autocomplete. (Continue Docs)

A practical pattern is:

  • chat/edit : your current instruct model
  • autocomplete : a smaller coder model such as QwenCoder2.5 1.5B

That avoids forcing one local model to satisfy two very different latency targets. (Continue Docs)

3. Shrink the base system prompt hard

Continue’s config supports baseSystemMessage, and its rules system appends rules into the system message for Chat, Agent, and Edit. So if the local model is slow, the fastest gain usually comes from replacing the default long instruction block with something minimal and removing extra rules until the loop is stable. (Continue Docs)

A good first-pass chat system prompt is just:

baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."

That is not magic. It just cuts prompt weight.

4. Cap prompt and output size aggressively

Continue exposes the exact settings you need:

  • defaultCompletionOptions.contextLength
  • defaultCompletionOptions.maxTokens
  • requestOptions.timeout
  • autocompleteOptions.maxPromptTokens
  • autocompleteOptions.modelTimeout
  • autocompleteOptions.onlyMyCode
  • autocompleteOptions.useImports
  • autocompleteOptions.useRecentlyEdited
  • autocompleteOptions.useRecentlyOpened (Continue Docs)

For a first stable setup, I would start around:

  • chat/edit contextLength: 4096
  • chat/edit maxTokens: 128 or 256
  • autocomplete maxPromptTokens: 256 to 384
  • autocomplete maxTokens: 32 to 64
  • onlyMyCode: true
  • useImports: false
  • useRecentlyEdited: false
  • useRecentlyOpened: false

Those numbers are my recommendation, not a Continue default. The reason is simple: prompt growth is usually what turns “slow but usable” into “appears broken.”

5. Increase timeouts before judging the setup

Continue supports request-level timeout controls and autocomplete timeout controls in config. If the model is working but slow, short client timeouts can make a valid local setup look dead. (Continue Docs)

A reasonable first pass is:

requestOptions:
  timeout: 180000

and for autocomplete:

autocompleteOptions:
  modelTimeout: 12000

That gives the local model more room while still keeping inline completion from hanging forever.

6. If you stay on Ollama, use the memory-saving switches

Ollama documents two settings that matter a lot for long context on small VRAM:

  • OLLAMA_FLASH_ATTENTION=1
  • OLLAMA_KV_CACHE_TYPE=q8_0 or q4_0 (Ollama)

Ollama states that Flash Attention can significantly reduce memory usage as context grows, and that KV-cache quantization reduces memory further. It documents q8_0 as using about half the memory of f16 with small quality loss, while q4_0 uses about one quarter with more noticeable degradation. (Ollama)

So the safest first server setting is:

OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Then only move to q4_0 if memory is still too tight.

7. Keep context modest first

Ollama’s context-length docs say that on systems with under 24 GiB VRAM , the default context is 4K , and that larger context needs more memory. They also recommend using ollama ps to check whether the model remains fully on GPU or gets partially offloaded to CPU. (Ollama)

So on a 6 GB class machine, I would not begin with 32K or 64K. I would test in this order:

  • 4096
  • 8192
  • 16384

and stop increasing once ollama ps shows CPU offload or latency becomes unusable. That step-up sequence is my recommendation, based on Ollama’s documented VRAM guidance. (Ollama)

8. Use a cleaner Continue config

A minimal local config for your setup could look like this:

name: Local HF
version: 1.0.0
schema: v1

models:
  - name: local-chat
    provider: openai
    apiBase: http://127.0.0.1:8000/v1
    model: qwen2.5-coder-3b-instruct
    roles: [chat, edit]
    baseSystemMessage: "You are a local coding assistant. Be brief. Prefer minimal diffs."
    defaultCompletionOptions:
      contextLength: 4096
      maxTokens: 256
      temperature: 0.2
    requestOptions:
      timeout: 180000

  - name: local-autocomplete
    provider: openai
    apiBase: http://127.0.0.1:8000/v1
    model: qwen2.5-coder-1.5b-instruct
    roles: [autocomplete]
    autocompleteOptions:
      debounceDelay: 400
      maxPromptTokens: 384
      modelTimeout: 12000
      onlyMyCode: true
      useImports: false
      useRecentlyEdited: false
      useRecentlyOpened: false
    defaultCompletionOptions:
      temperature: 0.1
      maxTokens: 64

Every field used there is documented by Continue’s config reference. The exact values are tuned for a small local setup. (Continue Docs)

9. Keep Agent mode off until the basics are solid

Continue’s agent flow sends tools along with chat requests and can loop through tool calls and tool results. That makes the prompt path heavier and more complex than plain chat or autocomplete. On constrained local hardware, agent mode is a later step, not the first one. (Continue Docs)

10. My priority order

If I were tuning your setup, I would do it in this order:

  1. Get autocomplete working fast with a small model. (Continue Docs)
  2. Strip chat down to a tiny base system message and no extra rules. (Continue Docs)
  3. Keep context at 4K–8K first. (Ollama)
  4. Turn on Flash Attention and q8_0 KV cache if using Ollama. (Ollama)
  5. Increase timeout before concluding the integration is failing. (Continue Docs)
  6. Only then add larger context, more rules, or agent features. (Continue Docs)

The main idea is simple: reduce the problem from “local IDE assistant” to “one small model, one small prompt, one fast feature.”

Discussion in the ATmosphere

Loading comments...