Using a Hugging Face Model offline to support code generation in VSCode
It should work fine if you use the extension… apparently:
This can be made to work. The cleanest route for your exact requirements is:
Use Continue with a localconfig.yaml, and point it at either
- a real OpenAI-compatible
/v1endpoint, or - a fuller Ollama-compatible shim than just
/api/generate. (Continue Docs)
The short reason is simple:
VS Code extensions do not care that the weights came from Hugging Face. They care about the HTTP protocol they are talking to. Continue explicitly supports both an Ollama provider with custom apiBase and an OpenAI-compatible provider with custom apiBase. AI Toolkit also supports custom Ollama endpoints and custom OpenAI-compatible endpoints. (Continue Docs)
What is probably happening in your case
Your server works for the one flow you tested with curl. The extension is likely trying more than that.
That is not speculation in the abstract. Continue’s Ollama implementation calls multiple endpoints, including GET /api/tags, POST /api/show, POST /api/chat, and POST /api/generate. There are also real Continue issues from users who could reach their server manually but then saw Continue request /api/show or /api/chat and fail. (GitHub)
So this part matters:
/api/generate alone is usually not enough.
If you only spoofed /api/generate, and later added /api/tags, that still leaves a gap for tools that probe /api/show and /api/chat. That fits your symptoms very well. (Ollama Documentation)
The background that makes this confusing
Inside VS Code, “AI coding” is not one feature. It is usually several:
- chat
- edit/apply
- inline completion
- indexing/embeddings/model discovery
Different tools use different endpoints for those. Continue’s OpenAI-compatible provider docs even mention forcing legacy completions usage, which is a clue that not every feature goes through the same route. Continue also documents separate model roles and separate autocomplete setup. (Continue Docs)
That is why “my server answers a prompt and returns response” is necessary but not sufficient.
The strongest recommendation for your setup
Use Continue first
Continue is the best match to what you asked for because it has:
- a documented offline / air-gapped guide
- documented local config
- explicit support for Ollama
- explicit support for OpenAI-compatible providers via
apiBase(Continue Docs)
Those are the clearest official explanations I found for “use a local model in an IDE without cloud dependency.” (Continue Docs)
Do not start with VS Code built-in chat
VS Code’s own docs say that when you use bring-your-own models for chat, the Copilot service API is still used for some tasks such as embeddings, repository indexing, query refinement, intent detection, and side queries. There are also issue reports explicitly asking for local models to work without GitHub login and completely offline , which means your complaint is shared by other users and is not solved by default. (Visual Studio Code)
So for your requirement of no login, no tracking, no tokens, no telemetry , VS Code’s built-in path is the wrong first target. (Visual Studio Code)
Why “Hugging Face, not Ollama” is the wrong dividing line
This is the key conceptual point.
“Hugging Face” is where your model and tooling come from. “Ollama” or “OpenAI-compatible” is the wire protocol your editor is speaking.
A Hugging Face model can sit behind:
- your own FastAPI wrapper
- TGI
- vLLM
- another OpenAI-compatible server
- an Ollama-like shim
The editor only sees the API. It does not know or care whether the weights originally came from Hugging Face. Continue’s OpenAI docs explicitly describe connecting to OpenAI-compatible providers via apiBase. AI Toolkit explicitly supports adding custom models with an OpenAI-compatible endpoint, and also custom Ollama endpoints. (Continue Docs)
So no, VS Code and Continue are not “intentionally incompatible with Hugging Face.” The real compatibility boundary is protocol shape , not model origin. (Continue Docs)
The two viable designs
Design A. Keep your current Ollama-style shim
This is the quickest path if you want to reuse your work.
But then implement a more complete Ollama subset:
GET /api/tagsPOST /api/showPOST /api/chatPOST /api/generate
Those are all part of Ollama’s documented API surface, and they are the same paths Continue users have reported seeing in practice. (Ollama Documentation)
The official Ollama API docs list generate, chat, embeddings, list models, and show model details. That matches the shape tools tend to expect. (Ollama Documentation)
Design B. Switch to an OpenAI-compatible /v1 endpoint
This is the cleaner long-term design.
Continue documents using provider: openai with a custom apiBase. AI Toolkit also documents adding a self-hosted or local model with an OpenAI-compatible endpoint. (Continue Docs)
For editor tooling, this is often easier to reuse across tools than a custom fake-Ollama server.
My view: Design B is better long-term. Design A is faster if you are already close.
The trap with OpenAI-compatible mode
Do not assume POST /v1/chat/completions is enough.
Continue’s docs mention legacy completions handling, and real user reports show cases where chat worked differently from edit/autocomplete because different endpoints were used. That means a backend that only supports chat-style calls may still fail in coding workflows. (Continue Docs)
So if you go OpenAI-compatible, expect to support at least the endpoints your chosen extension actually uses, not just the one you wish it used. (Continue Docs)
The clearest explanation of how to do it
The clearest official docs I found, in order, are:
Continue: How to Run Continue Without Internet Best overall explanation for your privacy goal. It covers offline setup, local providers, and disabling telemetry. (Continue Docs)
Continue: How to Understand Hub vs Local Configuration Best explanation of why local
config.yamlis the right path for an offline or restricted setup. (Continue Docs)Continue: How to Configure OpenAI Models with Continue Best explanation if you want to expose your Hugging Face model through a custom
/v1server. (Continue Docs)Continue: How to Configure Ollama with Continue Best explanation if you want to keep your current “spoof Ollama” idea. (Continue Docs)
Ollama API introduction Best reference for which
/api/...endpoints an Ollama-style server normally exposes. (Ollama Documentation)AI Toolkit model docs Useful mainly to confirm that custom Ollama endpoints and OpenAI-compatible endpoints are officially supported concepts. (Visual Studio Code)
What I would do if I were solving your exact problem
I would do this in order.
Step 1. Stop testing multiple VS Code AI extensions at once
Pick Continue first. It has the clearest docs for offline local use, and you can fully control the config locally. (Continue Docs)
Step 2. Decide whether you want the fastest win or the cleanest architecture
If you want the fastest win, keep your current server and make it answer:
/api/tags/api/show/api/chat/api/generate(Ollama Documentation)
If you want the cleanest architecture, expose an OpenAI-compatible/v1 API and point Continue’s provider: openai at it. (Continue Docs)
Step 3. Use local Continue config
Continue documents local config as machine-local, offline-capable, and suitable for strict data policies. That matches your stated goal exactly. (Continue Docs)
A minimal shape looks like this:
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Local HF via OpenAI API
provider: openai
model: qwen2.5-coder-3b
apiBase: http://127.0.0.1:8000/v1
That pattern follows Continue’s documented OpenAI-compatible configuration. (Continue Docs)
Or, if you keep the Ollama-style shim:
name: Local Config
version: 1.0.0
schema: v1
models:
- name: Local HF via Ollama Shim
provider: ollama
model: qwen2.5-coder:3b
apiBase: http://127.0.0.1:11434
That pattern follows Continue’s documented Ollama configuration. (Continue Docs)
Step 4. Disable everything nonessential for the first test
Do not try to solve chat, edit, autocomplete, indexing, and agents all at once.
First get one prompt-response loop working inside Continue chat. Then add edit. Then test inline completion. Continue’s docs and config model support this incremental approach. (Continue Docs)
About OpenClaw
OpenClaw is not intentionally incompatible with local Hugging Face models. Its current docs explicitly describe two local paths:
- native Ollama integration using
/api/chat - OpenAI-compatible local servers such as vLLM (OpenClaw)
So the answer is not “OpenClaw rejects Hugging Face.”
But I would still not use OpenClaw as your next step. Why:
- its docs are aimed at a broader agent stack, not the simplest VS Code coding-assistant setup
- there are recent issues around custom local providers,
baseUrl, and provider routing (GitHub)
So OpenClaw may become viable later, but it is a worse first target than Continue for your current goal. (OpenClaw)
My bottom-line judgment
For your case:
- Your privacy requirement is reasonable.
- Your Hugging Face local model choice is not the blocker.
- Your current fake-Ollama endpoint is probably too incomplete for the extension you are testing.
- Continue is the best first extension to target.
- VS Code built-in Copilot/BYOK is not a good fit for strict no-login offline use.
- OpenClaw is not intentionally incompatible, but it is the wrong next battle. (Continue Docs)
The cleanest practical path is:
Continue + local config + either
- a more complete Ollama-style shim, or
- a proper OpenAI-compatible
/v1server in front of your Hugging Face model. (Continue Docs)
Discussion in the ATmosphere