Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreih6fdvvryoccu7rykxssbf4u37gtyyxgzjc4bb6iu2giq6jfyobk4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi45g4vt2qs2"
  },
  "path": "/t/using-a-hugging-face-model-offline-to-support-code-generation-in-vscode/174627#post_6",
  "publishedAt": "2026-03-28T07:39:46.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Continue Docs",
    "Ollama"
  ],
  "textContent": "Plugins don’t seem to work very well with autocomplete features.\n\nAlso, regardless of the model size, when handling long context lengths with an LLM, if you don’t choose the right attention backend, performance can drop to absurdly slow levels… This is likely to be a problem for coding tasks.\nAnd if you’re using an older generation of GPUs, you may have fewer options for attention backends.\n\n* * *\n\nThe highest-leverage improvements are on the **prompt path** , **model-role split** , and **memory settings**.\n\n## 1. Make autocomplete the first success target\n\nIn Continue, **rules are included in Agent, Chat, and Edit** , but **not in autocomplete or apply**. Continue also currently recommends **QwenCoder2.5 1.5B** and **QwenCoder2.5 7B** as strong open autocomplete models. That makes autocomplete much lighter than chat on a small local machine. (Continue Docs)\n\nThat means your first target should be:\n\n  * fast inline completion\n  * small prompt window\n  * small output\n  * no extra repo context at first\n\n\n\nOnly after that is stable should you optimize chat/edit.\n\n## 2. Split chat and autocomplete into separate model roles\n\nContinue supports model roles such as `chat`, `edit`, and `autocomplete`, with separate settings for each. For your setup, that is the cleanest architecture. Use one model for chat/edit and a smaller, faster one for autocomplete. (Continue Docs)\n\nA practical pattern is:\n\n  * **chat/edit** : your current instruct model\n  * **autocomplete** : a smaller coder model such as QwenCoder2.5 1.5B\n\n\n\nThat avoids forcing one local model to satisfy two very different latency targets. (Continue Docs)\n\n## 3. Shrink the base system prompt hard\n\nContinue’s config supports `baseSystemMessage`, and its rules system appends rules into the system message for Chat, Agent, and Edit. So if the local model is slow, the fastest gain usually comes from replacing the default long instruction block with something minimal and removing extra rules until the loop is stable. (Continue Docs)\n\nA good first-pass chat system prompt is just:\n\n\n    baseSystemMessage: \"You are a local coding assistant. Be brief. Prefer minimal diffs.\"\n\n\nThat is not magic. It just cuts prompt weight.\n\n## 4. Cap prompt and output size aggressively\n\nContinue exposes the exact settings you need:\n\n  * `defaultCompletionOptions.contextLength`\n  * `defaultCompletionOptions.maxTokens`\n  * `requestOptions.timeout`\n  * `autocompleteOptions.maxPromptTokens`\n  * `autocompleteOptions.modelTimeout`\n  * `autocompleteOptions.onlyMyCode`\n  * `autocompleteOptions.useImports`\n  * `autocompleteOptions.useRecentlyEdited`\n  * `autocompleteOptions.useRecentlyOpened` (Continue Docs)\n\n\n\nFor a first stable setup, I would start around:\n\n  * chat/edit `contextLength: 4096`\n  * chat/edit `maxTokens: 128` or `256`\n  * autocomplete `maxPromptTokens: 256` to `384`\n  * autocomplete `maxTokens: 32` to `64`\n  * `onlyMyCode: true`\n  * `useImports: false`\n  * `useRecentlyEdited: false`\n  * `useRecentlyOpened: false`\n\n\n\nThose numbers are my recommendation, not a Continue default. The reason is simple: prompt growth is usually what turns “slow but usable” into “appears broken.”\n\n## 5. Increase timeouts before judging the setup\n\nContinue supports request-level timeout controls and autocomplete timeout controls in config. If the model is working but slow, short client timeouts can make a valid local setup look dead. (Continue Docs)\n\nA reasonable first pass is:\n\n\n    requestOptions:\n      timeout: 180000\n\n\nand for autocomplete:\n\n\n    autocompleteOptions:\n      modelTimeout: 12000\n\n\nThat gives the local model more room while still keeping inline completion from hanging forever.\n\n## 6. If you stay on Ollama, use the memory-saving switches\n\nOllama documents two settings that matter a lot for long context on small VRAM:\n\n  * `OLLAMA_FLASH_ATTENTION=1`\n  * `OLLAMA_KV_CACHE_TYPE=q8_0` or `q4_0` (Ollama)\n\n\n\nOllama states that Flash Attention can significantly reduce memory usage as context grows, and that KV-cache quantization reduces memory further. It documents `q8_0` as using about half the memory of `f16` with small quality loss, while `q4_0` uses about one quarter with more noticeable degradation. (Ollama)\n\nSo the safest first server setting is:\n\n\n    OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve\n\n\nThen only move to `q4_0` if memory is still too tight.\n\n## 7. Keep context modest first\n\nOllama’s context-length docs say that on systems with **under 24 GiB VRAM** , the default context is **4K** , and that larger context needs more memory. They also recommend using `ollama ps` to check whether the model remains fully on GPU or gets partially offloaded to CPU. (Ollama)\n\nSo on a 6 GB class machine, I would not begin with 32K or 64K. I would test in this order:\n\n  * `4096`\n  * `8192`\n  * `16384`\n\n\n\nand stop increasing once `ollama ps` shows CPU offload or latency becomes unusable. That step-up sequence is my recommendation, based on Ollama’s documented VRAM guidance. (Ollama)\n\n## 8. Use a cleaner Continue config\n\nA minimal local config for your setup could look like this:\n\n\n    name: Local HF\n    version: 1.0.0\n    schema: v1\n\n    models:\n      - name: local-chat\n        provider: openai\n        apiBase: http://127.0.0.1:8000/v1\n        model: qwen2.5-coder-3b-instruct\n        roles: [chat, edit]\n        baseSystemMessage: \"You are a local coding assistant. Be brief. Prefer minimal diffs.\"\n        defaultCompletionOptions:\n          contextLength: 4096\n          maxTokens: 256\n          temperature: 0.2\n        requestOptions:\n          timeout: 180000\n\n      - name: local-autocomplete\n        provider: openai\n        apiBase: http://127.0.0.1:8000/v1\n        model: qwen2.5-coder-1.5b-instruct\n        roles: [autocomplete]\n        autocompleteOptions:\n          debounceDelay: 400\n          maxPromptTokens: 384\n          modelTimeout: 12000\n          onlyMyCode: true\n          useImports: false\n          useRecentlyEdited: false\n          useRecentlyOpened: false\n        defaultCompletionOptions:\n          temperature: 0.1\n          maxTokens: 64\n\n\nEvery field used there is documented by Continue’s config reference. The exact values are tuned for a small local setup. (Continue Docs)\n\n## 9. Keep Agent mode off until the basics are solid\n\nContinue’s agent flow sends tools along with chat requests and can loop through tool calls and tool results. That makes the prompt path heavier and more complex than plain chat or autocomplete. On constrained local hardware, agent mode is a later step, not the first one. (Continue Docs)\n\n## 10. My priority order\n\nIf I were tuning your setup, I would do it in this order:\n\n  1. Get **autocomplete** working fast with a small model. (Continue Docs)\n  2. Strip chat down to a **tiny base system message** and no extra rules. (Continue Docs)\n  3. Keep context at **4K–8K first**. (Ollama)\n  4. Turn on **Flash Attention** and **`q8_0` KV cache** if using Ollama. (Ollama)\n  5. Increase timeout before concluding the integration is failing. (Continue Docs)\n  6. Only then add larger context, more rules, or agent features. (Continue Docs)\n\n\n\nThe main idea is simple: reduce the problem from “local IDE assistant” to “one small model, one small prompt, one fast feature.”",
  "title": "Using a Hugging Face Model offline to support code generation in VSCode"
}