Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib3xl3r7ssf2j42bvmplyxtry3h44py3akkxdwwv75a6pdmwu5rta",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mgu77i34lab2"
  },
  "path": "/t/soumitra-dutta-oxford-how-do-i-run-inference-with-a-hugging-face-model/174214#post_2",
  "publishedAt": "2026-03-12T09:58:09.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "Hugging Face",
    "GitHub"
  ],
  "textContent": "Transformers: Reference inference behavior for Transformers format models (obviously), compatibility\nHigh-speed backends (vLLM, SGLang, etc.): Speed and stability at large scale\nGGUF file-based backends (Ollama, Llama.cpp, LM Studio, etc.): Low VRAM, low RAM\n\nWell, it depends on your use case, the hardware you have, and the model you want to use…\n\n* * *\n\n## Best approach\n\nBecause you are already using a **Hugging Face Transformers model** and want to run it **locally** , the best default is:\n\n**start with plain Transformers first** , inside a normal Python script or app.\nDo **not** switch immediately to a separate runtime or local server unless you already know you need one. Hugging Face still presents `pipeline()` as the easiest inference entry point, and for LLM-style generation it recommends `generate()` when you need more control over prompting, decoding, and memory behavior. (Hugging Face)\n\n## Why this is usually the right choice\n\nThere are a few different layers that people mix together:\n\n  * **Transformers** is the model library itself.\n  * **`transformers serve`** is a local server layer on top of it.\n  * **vLLM / SGLang** are higher-performance serving engines.\n  * **Ollama / LM Studio / llama.cpp** are local deployment runtimes, often more natural for GGUF-style workflows than for native Python Transformers development. Hugging Face’s current serving docs explicitly say `transformers serve` is suitable for evaluation, experimentation, and **moderate-load** local or self-hosted use, while **vLLM** and **SGLang** remain the recommendation for large-scale production. (Hugging Face)\n\n\n\nSo if your situation is “I have a Transformers model and want local inference,” the most sensible first move is to stay in the same stack and make that path work cleanly before adding extra layers. (Hugging Face)\n\n## The simple decision rule\n\n### Choose plain Transformers if:\n\nYou are running inference from **Python** , still validating the model, still tuning prompts, or still figuring out memory limits. This is the most direct route, and Hugging Face’s docs recommend the Auto classes plus `dtype=\"auto\"` and `device_map=\"auto\"` as a practical starting point for loading larger models. (Hugging Face)\n\n### Choose `transformers serve` only if:\n\nYour local Python inference already works and you now want an **OpenAI-compatible local endpoint** for another tool, UI, or app to call. HF documents it as a local server with OpenAI SDK compatibility, but still positions it as **experimental** and best for moderate load rather than maximum throughput. (Hugging Face)\n\n### Choose something else only if your real goal changes\n\nIf your goal becomes **high-throughput multi-user serving** , that is where vLLM or SGLang starts to make more sense. If your goal becomes **ultra-simple desktop/offline deployment** , that is where Ollama, LM Studio, or llama.cpp starts to make more sense. But that is a different optimization target from “run my Transformers model locally in code.” (Hugging Face)\n\n## What I would do first in practice\n\n### If it is a chat or text-generation model\n\nUse **`AutoTokenizer` + `AutoModelForCausalLM`**, format chat input with **`apply_chat_template(...)`** when the model expects chat-formatted messages, and call **`generate()`**. HF’s LLM guide explains that autoregressive generation is handled by `generate()`, and that `GenerationConfig` controls defaults such as stopping and decoding behavior. (Hugging Face)\n\n### If it is a classic NLP / vision / audio model\n\nStart with **`pipeline()`** unless you already need low-level control. HF still describes pipelines as the easiest way to run inference across many tasks such as classification, QA, ASR, and feature extraction. (Hugging Face)\n\n## How to make a too-large local model fit\n\nThe first tool is **`dtype=\"auto\"`**. Hugging Face documents that this initializes weights in the dtype they are stored in, which can avoid unnecessary extra memory use during loading. (Hugging Face)\n\nThe second tool is **`device_map=\"auto\"`**. Accelerate’s Big Model Inference guide says this fills GPU memory first, then CPU, then disk if necessary. That is extremely useful for getting a model to run locally even when it does not fully fit in VRAM. (Hugging Face)\n\nBut there is an important background detail: `device_map=\"auto\"` is mainly a **fit-it-into-memory** strategy, not a **fastest-possible** strategy. Accelerate’s docs say this adds inference overhead because layers are moved between devices, and in multi-GPU model parallelism only **one GPU is active at a time** while the next waits for outputs from the previous one. (Hugging Face)\n\nSo the rule is:\n\n  * use `device_map=\"auto\"` to **make a large model run** ;\n  * do not expect it to be the best answer for **throughput** or **latency**. (Hugging Face)\n\n\n\n## If memory is still tight\n\nThe next thing to try is **quantization** , especially **8-bit** or **4-bit**. Hugging Face’s quantization docs say this reduces memory and compute costs and allows models that would not normally fit to run on more limited hardware; the bitsandbytes integration is the most common first step for local LLM inference. (Hugging Face)\n\nThat makes the usual progression:\n\n  1. plain model with `dtype=\"auto\"`\n  2. add `device_map=\"auto\"`\n  3. add **8-bit** quantization\n  4. if needed, try **4-bit** quantization. (Hugging Face)\n\n\n\n## One important exception: Apple Silicon\n\nIf you are on a **Mac with Apple Silicon** , plain Transformers is still a reasonable first step, but **MLX** is worth a serious look. Hugging Face’s MLX integration docs say MLX keeps arrays in shared memory on Apple Silicon, avoids CPU↔GPU copies, supports native safetensors loading, and can load supported Transformers language models from the Hub **without weight conversion**. (Hugging Face)\n\nSo on Apple Silicon, my recommendation becomes:\n\n  * **Transformers first** if you want the most standard HF/PyTorch path;\n  * **MLX** if local speed and Apple-native efficiency become more important. (Hugging Face)\n\n\n\n## Another exception: CPU-focused non-LLM inference\n\nIf your model is **not** a chat LLM, and you care about **CPU latency** for tasks like classification or QA, **Optimum ONNX Runtime** is worth considering. HF documents ONNX Runtime pipelines as a **drop-in replacement** for Transformers pipelines, with the same API and potential speedups on CPU and GPU. (Hugging Face)\n\nThat means:\n\n  * for local LLM work, start with standard Transformers;\n  * for local non-LLM task inference where latency matters, ONNX Runtime can be a strong second step. (Hugging Face)\n\n\n\n## The main pitfalls to avoid\n\n### 1. Jumping runtimes too early\n\nA lot of inference debugging is really about **model choice, prompt format, tokenizer behavior, or memory fit** , not the runtime itself. If you switch to a different runtime before validating those basics, you make debugging harder. HF’s docs already give you enough to validate those basics in native Transformers. (Hugging Face)\n\n### 2. Treating `device_map=\"auto\"` as a speed feature\n\nIt is primarily a **memory survival** feature. It can be the difference between “runs” and “doesn’t run,” but it is not the same thing as a tuned serving stack. (Hugging Face)\n\n### 3. Copying older examples blindly\n\nRecent Transformers releases still show ongoing **v5 cleanup** and changes in generation internals. The current release notes mention pipeline task updates/removals in the v5 cleanup and continued refactoring of generation input preparation away from older `cache_position` behavior. (GitHub)\n\n### 4. Assuming chat-template behavior is identical across every model\n\nThe docs show the intended chat-template path, but there have also been recent model-specific issues around `apply_chat_template(...)` return types and downstream `generate()` behavior, especially in multimodal cases. If you hit a weird template/generation mismatch, check current issues before assuming your code is fundamentally wrong. (Hugging Face)\n\n## My clear recommendation for your case\n\nIf you are currently trying to run **local inference with a Hugging Face Transformers model** , I would do this:\n\n**Best default path**\n\n  * Stay in **plain Transformers**\n  * Load with **Auto classes**\n  * Start with **`dtype=\"auto\"`**\n  * Add **`device_map=\"auto\"`** if the model is large\n  * Add **8-bit / 4-bit quantization** only if memory is still the blocker\n  * Only after the model works reliably, decide whether you need a **local server** like `transformers serve` or a different runtime. (Hugging Face)\n\n\n\n## Bottom line\n\nFor **local inference with a Hugging Face Transformers model** , the best approach is usually **not** to leave Transformers immediately.\n\nIt is:\n\n**Transformers first for correctness and fit.\nQuantization if needed for memory.\n`transformers serve` only if you want a local API.\nA different runtime only if your real goal is no longer “run this Transformers model locally in Python.”** (Hugging Face)",
  "title": "Soumitra Dutta Oxford: How do I run inference with a Hugging Face model?"
}