External Publication
Visit Post

Using a Hugging Face Model offline to support code generation in VSCode

Hugging Face Forums [Unofficial] March 27, 2026
Source

Thank you for the detailed AI responses.

For clarity, I was not trying different pluggins at once, I tried Continue, LM Studio, CodeGPT and AI Tools in VSCode, separately, and deleted each before trying the next.

The Continue docs for running offline, without internet, the guide here:

docs.continue.dev

How to Run Continue Without Internet | Continue Docs

Learn how to set up Continue for air-gapped or offline environments using local models, including steps to disable telemetry and configure local model providers

provides only one link, in point 3, that points to model-providers/ollama, and that page is blank. There is no specific configuration given there.

My understanding is Ollama and Hugging Face provide there own ecosystems, interface code, to load and set up models, pass prompts to the tokenizer, and then read the response. The includes I use in Python, the functions I call, seem to be unique to the Hugging Face ecosystem. While the model downloaded may (or may not) be the same when run through the Ollama and Hugging Face ecosystems, the front end is different. I prefer the Hugging Face ecosystem which I have tested and proved in my tests to not make any connection to the internet after the model is loaded onto the local disk by a different program, and provided the environment variables:

os.environ[‘TRANSFORMERS_OFFLINE’] = ‘1’ os.environ[‘HF_DATASETS_OFFLINE’] = ‘1’

are defined near the beginning of the code. I use the linux “strace” command to verify this:

strace -f -e connect -s 10000 -o trace.log python3 MyCode.py

Note the -f option follows all processes that are spawned by the code. Zero connections made. Ollama cannot do better than zero, so I prefer to stick with what I have tested.

@hellencharless54 my research indicates, to create a personal extension to VSCode, written in Python, I must write a typescript (or javascript) wrapper to connect VSCode to the Python script. To do this I have to use npm and generator-code and Yeoman (yo). I have had no previous experience in typescript or javascript, npm, Yeoman, or anything like that. I will see what I can get my LLM to write for me but right now the details and requirements and scope of the project for writing a VSCode extension are fuzzy to me.

I am thinking changing to an OpenAI compatible API format for my uvicorn server is probably a better idea.

I got the following code as a starting point:

import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from typing import List, Optional

# 1. Load Model and Tokenizer
MODEL_ID = "gpt2" # Replace with your local model path or HF hub ID
print(f"Loading model: {MODEL_ID}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

app = FastAPI(title="Local HuggingFace OpenAI-Compatible API")

# 2. Define OpenAI-compatible Schemas
class ChatMessage(BaseModel):
    role: str
    content: str

class ChatCompletionRequest(BaseModel):
    model: str
    messages: List[ChatMessage]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 50

class ChatCompletionResponse(BaseModel):
    choices: list

# 3. API Endpoint
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
async def chat_completions(request: ChatCompletionRequest):
    # Convert chat messages to a single prompt
    prompt = "\n".join([msg.content for msg in request.messages])

    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # Generate response
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=request.max_tokens,
            temperature=request.temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    response_text = tokenizer.decode(output_ids[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

    # Format as OpenAI response
    return {
        "choices": [{
            "message": {"role": "assistant", "content": response_text},
            "finish_reason": "stop",
            "index": 0
        }]
    }

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

Discussion in the ATmosphere

Loading comments...