Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreibhfuq5br7iy3o4il6wuppr6f7tf37yw7zoq3jewvrlpo4tri2y5q",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mi2oesrg3342"
  },
  "path": "/t/using-a-hugging-face-model-offline-to-support-code-generation-in-vscode/174627#post_4",
  "publishedAt": "2026-03-27T17:31:13.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "docs.continue.dev",
    "How to Run Continue Without Internet | Continue Docs",
    "@hellencharless54",
    "@app.post"
  ],
  "textContent": "Thank you for the detailed AI responses.\n\nFor clarity, I was not trying different pluggins at once, I tried Continue, LM Studio, CodeGPT and AI Tools in VSCode, separately, and deleted each before trying the next.\n\nThe Continue docs for running offline, without internet, the guide here:\n\ndocs.continue.dev\n\n### How to Run Continue Without Internet | Continue Docs\n\nLearn how to set up Continue for air-gapped or offline environments using local models, including steps to disable telemetry and configure local model providers\n\nprovides only one link, in point 3, that points to model-providers/ollama, and that page is blank. There is no specific configuration given there.\n\nMy understanding is Ollama and Hugging Face provide there own ecosystems, interface code, to load and set up models, pass prompts to the tokenizer, and then read the response. The includes I use in Python, the functions I call, seem to be unique to the Hugging Face ecosystem. While the model downloaded may (or may not) be the same when run through the Ollama and Hugging Face ecosystems, the front end is different. I prefer the Hugging Face ecosystem which I have tested and proved in my tests to not make any connection to the internet after the model is loaded onto the local disk by a different program, and provided the environment variables:\n\nos.environ[‘TRANSFORMERS_OFFLINE’] = ‘1’\nos.environ[‘HF_DATASETS_OFFLINE’] = ‘1’\n\nare defined near the beginning of the code. I use the linux “strace” command to verify this:\n\nstrace -f -e connect -s 10000 -o trace.log python3 MyCode.py\n\nNote the -f option follows all processes that are spawned by the code. Zero connections made. Ollama cannot do better than zero, so I prefer to stick with what I have tested.\n\n@hellencharless54 my research indicates, to create a personal extension to VSCode, written in Python, I must write a typescript (or javascript) wrapper to connect VSCode to the Python script. To do this I have to use npm and generator-code and Yeoman (yo). I have had no previous experience in typescript or javascript, npm, Yeoman, or anything like that. I will see what I can get my LLM to write for me but right now the details and requirements and scope of the project for writing a VSCode extension are fuzzy to me.\n\nI am thinking changing to an OpenAI compatible API format for my uvicorn server is probably a better idea.\n\nI got the following code as a starting point:\n\n\n    import uvicorn\n    from fastapi import FastAPI, HTTPException\n    from pydantic import BaseModel\n    from transformers import AutoModelForCausalLM, AutoTokenizer\n    import torch\n    from typing import List, Optional\n\n    # 1. Load Model and Tokenizer\n    MODEL_ID = \"gpt2\" # Replace with your local model path or HF hub ID\n    print(f\"Loading model: {MODEL_ID}...\")\n    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)\n    model = AutoModelForCausalLM.from_pretrained(MODEL_ID)\n\n    # Move to GPU if available\n    device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n    model.to(device)\n\n    app = FastAPI(title=\"Local HuggingFace OpenAI-Compatible API\")\n\n    # 2. Define OpenAI-compatible Schemas\n    class ChatMessage(BaseModel):\n        role: str\n        content: str\n\n    class ChatCompletionRequest(BaseModel):\n        model: str\n        messages: List[ChatMessage]\n        temperature: Optional[float] = 0.7\n        max_tokens: Optional[int] = 50\n\n    class ChatCompletionResponse(BaseModel):\n        choices: list\n\n    # 3. API Endpoint\n    @app.post(\"/v1/chat/completions\", response_model=ChatCompletionResponse)\n    async def chat_completions(request: ChatCompletionRequest):\n        # Convert chat messages to a single prompt\n        prompt = \"\\n\".join([msg.content for msg in request.messages])\n\n        inputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n\n        # Generate response\n        with torch.no_grad():\n            output_ids = model.generate(\n                **inputs,\n                max_new_tokens=request.max_tokens,\n                temperature=request.temperature,\n                do_sample=True,\n                pad_token_id=tokenizer.eos_token_id\n            )\n\n        response_text = tokenizer.decode(output_ids[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n\n        # Format as OpenAI response\n        return {\n            \"choices\": [{\n                \"message\": {\"role\": \"assistant\", \"content\": response_text},\n                \"finish_reason\": \"stop\",\n                \"index\": 0\n            }]\n        }\n\n    if __name__ == \"__main__\":\n        uvicorn.run(app, host=\"0.0.0.0\", port=8000)\n    ```\n",
  "title": "Using a Hugging Face Model offline to support code generation in VSCode"
}