Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreihacqv7hsbzfh5rrv6sidm7rb6gwkh7q6ydask3gpzhyd7a6roo3y",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mhwi4zb3jnt2"
  },
  "path": "/t/using-a-hugging-face-model-offline-to-support-code-generation-in-vscode/174627#post_2",
  "publishedAt": "2026-03-25T23:48:40.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Continue Docs",
    "GitHub",
    "Ollama Documentation",
    "Visual Studio Code",
    "OpenClaw"
  ],
  "textContent": "It should work fine if you use the extension… apparently:\n\n* * *\n\nThis can be made to work. The cleanest route for your exact requirements is:\n\n**Use Continue with a local`config.yaml`, and point it at either**\n\n  1. **a real OpenAI-compatible`/v1` endpoint**, or\n  2. **a fuller Ollama-compatible shim than just`/api/generate`**. (Continue Docs)\n\n\n\nThe short reason is simple:\n\n**VS Code extensions do not care that the weights came from Hugging Face. They care about the HTTP protocol they are talking to.** Continue explicitly supports both an Ollama provider with custom `apiBase` and an OpenAI-compatible provider with custom `apiBase`. AI Toolkit also supports custom Ollama endpoints and custom OpenAI-compatible endpoints. (Continue Docs)\n\n## What is probably happening in your case\n\nYour server works for the one flow you tested with `curl`. The extension is likely trying more than that.\n\nThat is not speculation in the abstract. Continue’s Ollama implementation calls multiple endpoints, including `GET /api/tags`, `POST /api/show`, `POST /api/chat`, and `POST /api/generate`. There are also real Continue issues from users who could reach their server manually but then saw Continue request `/api/show` or `/api/chat` and fail. (GitHub)\n\nSo this part matters:\n\n**`/api/generate` alone is usually not enough.**\n\nIf you only spoofed `/api/generate`, and later added `/api/tags`, that still leaves a gap for tools that probe `/api/show` and `/api/chat`. That fits your symptoms very well. (Ollama Documentation)\n\n## The background that makes this confusing\n\nInside VS Code, “AI coding” is not one feature. It is usually several:\n\n  * chat\n  * edit/apply\n  * inline completion\n  * indexing/embeddings/model discovery\n\n\n\nDifferent tools use different endpoints for those. Continue’s OpenAI-compatible provider docs even mention forcing legacy `completions` usage, which is a clue that not every feature goes through the same route. Continue also documents separate model roles and separate autocomplete setup. (Continue Docs)\n\nThat is why “my server answers a prompt and returns `response`” is necessary but not sufficient.\n\n## The strongest recommendation for your setup\n\n### Use Continue first\n\nContinue is the best match to what you asked for because it has:\n\n  * a documented **offline / air-gapped** guide\n  * documented **local config**\n  * explicit support for **Ollama**\n  * explicit support for **OpenAI-compatible providers** via `apiBase` (Continue Docs)\n\n\n\nThose are the clearest official explanations I found for “use a local model in an IDE without cloud dependency.” (Continue Docs)\n\n### Do not start with VS Code built-in chat\n\nVS Code’s own docs say that when you use bring-your-own models for chat, the Copilot service API is still used for some tasks such as embeddings, repository indexing, query refinement, intent detection, and side queries. There are also issue reports explicitly asking for local models to work **without GitHub login** and **completely offline** , which means your complaint is shared by other users and is not solved by default. (Visual Studio Code)\n\nSo for your requirement of **no login, no tracking, no tokens, no telemetry** , VS Code’s built-in path is the wrong first target. (Visual Studio Code)\n\n## Why “Hugging Face, not Ollama” is the wrong dividing line\n\nThis is the key conceptual point.\n\n“Hugging Face” is where your model and tooling come from. “Ollama” or “OpenAI-compatible” is the wire protocol your editor is speaking.\n\nA Hugging Face model can sit behind:\n\n  * your own FastAPI wrapper\n  * TGI\n  * vLLM\n  * another OpenAI-compatible server\n  * an Ollama-like shim\n\n\n\nThe editor only sees the API. It does not know or care whether the weights originally came from Hugging Face. Continue’s OpenAI docs explicitly describe connecting to OpenAI-compatible providers via `apiBase`. AI Toolkit explicitly supports adding custom models with an OpenAI-compatible endpoint, and also custom Ollama endpoints. (Continue Docs)\n\nSo no, VS Code and Continue are not “intentionally incompatible with Hugging Face.” The real compatibility boundary is **protocol shape** , not model origin. (Continue Docs)\n\n## The two viable designs\n\n### Design A. Keep your current Ollama-style shim\n\nThis is the quickest path if you want to reuse your work.\n\nBut then implement a more complete Ollama subset:\n\n  * `GET /api/tags`\n  * `POST /api/show`\n  * `POST /api/chat`\n  * `POST /api/generate`\n\n\n\nThose are all part of Ollama’s documented API surface, and they are the same paths Continue users have reported seeing in practice. (Ollama Documentation)\n\nThe official Ollama API docs list generate, chat, embeddings, list models, and show model details. That matches the shape tools tend to expect. (Ollama Documentation)\n\n### Design B. Switch to an OpenAI-compatible `/v1` endpoint\n\nThis is the cleaner long-term design.\n\nContinue documents using `provider: openai` with a custom `apiBase`. AI Toolkit also documents adding a self-hosted or local model with an OpenAI-compatible endpoint. (Continue Docs)\n\nFor editor tooling, this is often easier to reuse across tools than a custom fake-Ollama server.\n\nMy view: **Design B is better long-term. Design A is faster if you are already close.**\n\n## The trap with OpenAI-compatible mode\n\nDo not assume `POST /v1/chat/completions` is enough.\n\nContinue’s docs mention legacy completions handling, and real user reports show cases where chat worked differently from edit/autocomplete because different endpoints were used. That means a backend that only supports chat-style calls may still fail in coding workflows. (Continue Docs)\n\nSo if you go OpenAI-compatible, expect to support at least the endpoints your chosen extension actually uses, not just the one you wish it used. (Continue Docs)\n\n## The clearest explanation of how to do it\n\nThe clearest official docs I found, in order, are:\n\n  1. **Continue: How to Run Continue Without Internet**\nBest overall explanation for your privacy goal. It covers offline setup, local providers, and disabling telemetry. (Continue Docs)\n\n  2. **Continue: How to Understand Hub vs Local Configuration**\nBest explanation of why local `config.yaml` is the right path for an offline or restricted setup. (Continue Docs)\n\n  3. **Continue: How to Configure OpenAI Models with Continue**\nBest explanation if you want to expose your Hugging Face model through a custom `/v1` server. (Continue Docs)\n\n  4. **Continue: How to Configure Ollama with Continue**\nBest explanation if you want to keep your current “spoof Ollama” idea. (Continue Docs)\n\n  5. **Ollama API introduction**\nBest reference for which `/api/...` endpoints an Ollama-style server normally exposes. (Ollama Documentation)\n\n  6. **AI Toolkit model docs**\nUseful mainly to confirm that custom Ollama endpoints and OpenAI-compatible endpoints are officially supported concepts. (Visual Studio Code)\n\n\n\n\n## What I would do if I were solving your exact problem\n\nI would do this in order.\n\n### Step 1. Stop testing multiple VS Code AI extensions at once\n\nPick **Continue** first. It has the clearest docs for offline local use, and you can fully control the config locally. (Continue Docs)\n\n### Step 2. Decide whether you want the fastest win or the cleanest architecture\n\nIf you want the fastest win, keep your current server and make it answer:\n\n  * `/api/tags`\n  * `/api/show`\n  * `/api/chat`\n  * `/api/generate` (Ollama Documentation)\n\n\n\nIf you want the cleanest architecture, expose an **OpenAI-compatible`/v1`** API and point Continue’s `provider: openai` at it. (Continue Docs)\n\n### Step 3. Use local Continue config\n\nContinue documents local config as machine-local, offline-capable, and suitable for strict data policies. That matches your stated goal exactly. (Continue Docs)\n\nA minimal shape looks like this:\n\n\n    name: Local Config\n    version: 1.0.0\n    schema: v1\n\n    models:\n      - name: Local HF via OpenAI API\n        provider: openai\n        model: qwen2.5-coder-3b\n        apiBase: http://127.0.0.1:8000/v1\n\n\nThat pattern follows Continue’s documented OpenAI-compatible configuration. (Continue Docs)\n\nOr, if you keep the Ollama-style shim:\n\n\n    name: Local Config\n    version: 1.0.0\n    schema: v1\n\n    models:\n      - name: Local HF via Ollama Shim\n        provider: ollama\n        model: qwen2.5-coder:3b\n        apiBase: http://127.0.0.1:11434\n\n\nThat pattern follows Continue’s documented Ollama configuration. (Continue Docs)\n\n### Step 4. Disable everything nonessential for the first test\n\nDo not try to solve chat, edit, autocomplete, indexing, and agents all at once.\n\nFirst get one prompt-response loop working inside Continue chat. Then add edit. Then test inline completion. Continue’s docs and config model support this incremental approach. (Continue Docs)\n\n## About OpenClaw\n\nOpenClaw is **not intentionally incompatible with local Hugging Face models**. Its current docs explicitly describe two local paths:\n\n  * native Ollama integration using `/api/chat`\n  * OpenAI-compatible local servers such as vLLM (OpenClaw)\n\n\n\nSo the answer is not “OpenClaw rejects Hugging Face.”\n\nBut I would still not use OpenClaw as your next step. Why:\n\n  * its docs are aimed at a broader agent stack, not the simplest VS Code coding-assistant setup\n  * there are recent issues around custom local providers, `baseUrl`, and provider routing (GitHub)\n\n\n\nSo OpenClaw may become viable later, but it is a worse first target than Continue for your current goal. (OpenClaw)\n\n## My bottom-line judgment\n\nFor your case:\n\n  * **Your privacy requirement is reasonable.**\n  * **Your Hugging Face local model choice is not the blocker.**\n  * **Your current fake-Ollama endpoint is probably too incomplete for the extension you are testing.**\n  * **Continue is the best first extension to target.**\n  * **VS Code built-in Copilot/BYOK is not a good fit for strict no-login offline use.**\n  * **OpenClaw is not intentionally incompatible, but it is the wrong next battle.** (Continue Docs)\n\n\n\nThe cleanest practical path is:\n\n**Continue + local config + either**\n\n  * a more complete Ollama-style shim, or\n  * a proper OpenAI-compatible `/v1` server in front of your Hugging Face model. (Continue Docs)\n\n",
  "title": "Using a Hugging Face Model offline to support code generation in VSCode"
}