External Publication

Visit Post

Using a Hugging Face Model offline to support code generation in VSCode

Hugging Face Forums [Unofficial] March 25, 2026

Source

It should work fine if you use the extension… apparently:

This can be made to work. The cleanest route for your exact requirements is:

Use Continue with a localconfig.yaml, and point it at either

a real OpenAI-compatible/v1 endpoint, or
a fuller Ollama-compatible shim than just/api/generate. (Continue Docs)

The short reason is simple:

VS Code extensions do not care that the weights came from Hugging Face. They care about the HTTP protocol they are talking to. Continue explicitly supports both an Ollama provider with custom apiBase and an OpenAI-compatible provider with custom apiBase. AI Toolkit also supports custom Ollama endpoints and custom OpenAI-compatible endpoints. (Continue Docs)

What is probably happening in your case

Your server works for the one flow you tested with curl. The extension is likely trying more than that.

That is not speculation in the abstract. Continue’s Ollama implementation calls multiple endpoints, including GET /api/tags, POST /api/show, POST /api/chat, and POST /api/generate. There are also real Continue issues from users who could reach their server manually but then saw Continue request /api/show or /api/chat and fail. (GitHub)

So this part matters:

/api/generate alone is usually not enough.

If you only spoofed /api/generate, and later added /api/tags, that still leaves a gap for tools that probe /api/show and /api/chat. That fits your symptoms very well. (Ollama Documentation)

The background that makes this confusing

Inside VS Code, “AI coding” is not one feature. It is usually several:

chat
edit/apply
inline completion
indexing/embeddings/model discovery

Different tools use different endpoints for those. Continue’s OpenAI-compatible provider docs even mention forcing legacy completions usage, which is a clue that not every feature goes through the same route. Continue also documents separate model roles and separate autocomplete setup. (Continue Docs)

That is why “my server answers a prompt and returns response” is necessary but not sufficient.

The strongest recommendation for your setup

Use Continue first

Continue is the best match to what you asked for because it has:

a documented offline / air-gapped guide
documented local config
explicit support for Ollama
explicit support for OpenAI-compatible providers via apiBase (Continue Docs)

Those are the clearest official explanations I found for “use a local model in an IDE without cloud dependency.” (Continue Docs)

Do not start with VS Code built-in chat

VS Code’s own docs say that when you use bring-your-own models for chat, the Copilot service API is still used for some tasks such as embeddings, repository indexing, query refinement, intent detection, and side queries. There are also issue reports explicitly asking for local models to work without GitHub login and completely offline , which means your complaint is shared by other users and is not solved by default. (Visual Studio Code)

So for your requirement of no login, no tracking, no tokens, no telemetry , VS Code’s built-in path is the wrong first target. (Visual Studio Code)

Why “Hugging Face, not Ollama” is the wrong dividing line

This is the key conceptual point.

“Hugging Face” is where your model and tooling come from. “Ollama” or “OpenAI-compatible” is the wire protocol your editor is speaking.

A Hugging Face model can sit behind:

your own FastAPI wrapper
TGI
vLLM
another OpenAI-compatible server
an Ollama-like shim

The editor only sees the API. It does not know or care whether the weights originally came from Hugging Face. Continue’s OpenAI docs explicitly describe connecting to OpenAI-compatible providers via apiBase. AI Toolkit explicitly supports adding custom models with an OpenAI-compatible endpoint, and also custom Ollama endpoints. (Continue Docs)

So no, VS Code and Continue are not “intentionally incompatible with Hugging Face.” The real compatibility boundary is protocol shape , not model origin. (Continue Docs)

The two viable designs

Design A. Keep your current Ollama-style shim

This is the quickest path if you want to reuse your work.

But then implement a more complete Ollama subset:

GET /api/tags
POST /api/show
POST /api/chat
POST /api/generate

Those are all part of Ollama’s documented API surface, and they are the same paths Continue users have reported seeing in practice. (Ollama Documentation)

The official Ollama API docs list generate, chat, embeddings, list models, and show model details. That matches the shape tools tend to expect. (Ollama Documentation)

Design B. Switch to an OpenAI-compatible `/v1` endpoint

This is the cleaner long-term design.

Continue documents using provider: openai with a custom apiBase. AI Toolkit also documents adding a self-hosted or local model with an OpenAI-compatible endpoint. (Continue Docs)

For editor tooling, this is often easier to reuse across tools than a custom fake-Ollama server.

My view: Design B is better long-term. Design A is faster if you are already close.

The trap with OpenAI-compatible mode

Do not assume POST /v1/chat/completions is enough.

Continue’s docs mention legacy completions handling, and real user reports show cases where chat worked differently from edit/autocomplete because different endpoints were used. That means a backend that only supports chat-style calls may still fail in coding workflows. (Continue Docs)

So if you go OpenAI-compatible, expect to support at least the endpoints your chosen extension actually uses, not just the one you wish it used. (Continue Docs)

The clearest explanation of how to do it

The clearest official docs I found, in order, are:

Continue: How to Run Continue Without Internet Best overall explanation for your privacy goal. It covers offline setup, local providers, and disabling telemetry. (Continue Docs)
Continue: How to Understand Hub vs Local Configuration Best explanation of why local config.yaml is the right path for an offline or restricted setup. (Continue Docs)
Continue: How to Configure OpenAI Models with Continue Best explanation if you want to expose your Hugging Face model through a custom /v1 server. (Continue Docs)
Continue: How to Configure Ollama with Continue Best explanation if you want to keep your current “spoof Ollama” idea. (Continue Docs)
Ollama API introduction Best reference for which /api/... endpoints an Ollama-style server normally exposes. (Ollama Documentation)
AI Toolkit model docs Useful mainly to confirm that custom Ollama endpoints and OpenAI-compatible endpoints are officially supported concepts. (Visual Studio Code)

What I would do if I were solving your exact problem

I would do this in order.

Step 1. Stop testing multiple VS Code AI extensions at once

Pick Continue first. It has the clearest docs for offline local use, and you can fully control the config locally. (Continue Docs)

Step 2. Decide whether you want the fastest win or the cleanest architecture

If you want the fastest win, keep your current server and make it answer:

/api/tags
/api/show
/api/chat
/api/generate (Ollama Documentation)

If you want the cleanest architecture, expose an OpenAI-compatible/v1 API and point Continue’s provider: openai at it. (Continue Docs)

Step 3. Use local Continue config

Continue documents local config as machine-local, offline-capable, and suitable for strict data policies. That matches your stated goal exactly. (Continue Docs)

A minimal shape looks like this:

name: Local Config
version: 1.0.0
schema: v1

models:
  - name: Local HF via OpenAI API
    provider: openai
    model: qwen2.5-coder-3b
    apiBase: http://127.0.0.1:8000/v1

That pattern follows Continue’s documented OpenAI-compatible configuration. (Continue Docs)

Or, if you keep the Ollama-style shim:

name: Local Config
version: 1.0.0
schema: v1

models:
  - name: Local HF via Ollama Shim
    provider: ollama
    model: qwen2.5-coder:3b
    apiBase: http://127.0.0.1:11434

That pattern follows Continue’s documented Ollama configuration. (Continue Docs)

Step 4. Disable everything nonessential for the first test

Do not try to solve chat, edit, autocomplete, indexing, and agents all at once.

First get one prompt-response loop working inside Continue chat. Then add edit. Then test inline completion. Continue’s docs and config model support this incremental approach. (Continue Docs)

About OpenClaw

OpenClaw is not intentionally incompatible with local Hugging Face models. Its current docs explicitly describe two local paths:

native Ollama integration using /api/chat
OpenAI-compatible local servers such as vLLM (OpenClaw)

So the answer is not “OpenClaw rejects Hugging Face.”

But I would still not use OpenClaw as your next step. Why:

its docs are aimed at a broader agent stack, not the simplest VS Code coding-assistant setup
there are recent issues around custom local providers, baseUrl, and provider routing (GitHub)

So OpenClaw may become viable later, but it is a worse first target than Continue for your current goal. (OpenClaw)

My bottom-line judgment

For your case:

Your privacy requirement is reasonable.
Your Hugging Face local model choice is not the blocker.
Your current fake-Ollama endpoint is probably too incomplete for the extension you are testing.
Continue is the best first extension to target.
VS Code built-in Copilot/BYOK is not a good fit for strict no-login offline use.
OpenClaw is not intentionally incompatible, but it is the wrong next battle. (Continue Docs)

The cleanest practical path is:

Continue + local config + either

a more complete Ollama-style shim, or
a proper OpenAI-compatible /v1 server in front of your Hugging Face model. (Continue Docs)