{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreid37gultdfhyidhhfss7pj22z25rjfzzxtpx33ibsjhpysi653at4",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3meqf4d7k72e2"
  },
  "path": "/t/how-are-you-deploying-hf-models-that-don-t-have-inference-providers/172964#post_4",
  "publishedAt": "2026-02-13T10:45:39.000Z",
  "site": "https://discuss.huggingface.co",
  "tags": [
    "GitHub",
    "GitHub",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face",
    "Hugging Face",
    "Google Cloud Documentation",
    "Google Cloud Documentation",
    "Google Cloud Documentation",
    "Microsoft Learn",
    "Azure sentence",
    "Modal",
    "Runpod Documentation",
    "Runpod Documentation",
    "Baseten Docs",
    "Replicate",
    "Together AI",
    "Fireworks AI",
    "Groq",
    "Deep Infra",
    "Hugging Face",
    "Google Cloud Documentation",
    "Microsoft Learn",
    "Modal",
    "Together AI",
    "Google Cloud Documentation",
    "Baseten Docs"
  ],
  "textContent": "for now, popular options.\n\n* * *\n\n## How teams deploy these models in real projects (the common pattern)\n\nEven if a model has no attached provider, most teams still deploy it by doing:\n\n  1. **Pull weights from the Hub** (private/gated with an HF token if needed)\n\n  2. **Serve the model with a standard inference server** (LLMs: vLLM or HF TGI are common)\n\n  3. **Run that server on a platform that matches their cost/latency needs** :\n\n     * serverless / scale-to-zero for spiky traffic\n     * always-on for consistent low latency\n\n\n\nTwo serving stacks you’ll see repeatedly:\n\n  * **vLLM** : widely used for LLM inference/serving and optimized throughput (GitHub)\n\n  * **Hugging Face TGI (Text Generation Inference)** : used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)\n\n    * TGI supports a **Messages API compatible with OpenAI Chat Completions** , which simplifies integration and swapping providers (Hugging Face)\n\n\n\n* * *\n\n## Serverless / scale-to-zero options (HF and non-HF)\n\nThese are the most common “serverless” ways people deploy Hub models that don’t have attached providers.\n\n### A) Hugging Face-managed\n\n**1) Inference Endpoints (autoscaling + scale-to-zero)**\n\n  * HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)\n  * Expect cold starts when scaled to 0 (first request after idle wakes it).\n\n\n\n**2) Spaces (prototype / light production)**\n\n  * Free CPU tier: **16GB RAM, 2 CPU cores, 50GB ephemeral disk** (Hugging Face)\n  * GPU upgrades are available (paid) (Hugging Face)\n  * Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.\n\n\n\n### B) “Real serverless” GPU containers (you bring a Docker image)\n\n**3) Google Cloud Run + GPUs**\n\n  * Official docs: GPU services **can scale down to zero** for cost savings (Google Cloud Documentation)\n  * Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)\n  * Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)\n\n\n\n**4) Azure Container Apps “serverless GPUs”**\n\n  * Microsoft explicitly positions this as **automatic scaling + per-second billing + scale down to zero** (Microsoft Learn)\n  * Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)\n\n\n\n### C) Serverless inference platforms (popular for “BYO model weights”)\n\n**5) Modal**\n\n  * Modal docs: functions **scale to zero by default** when idle (Modal)\n  * Strong DX if your team likes “code-first” deployments.\n\n\n\n**6) RunPod Serverless**\n\n  * Docs: endpoints can **auto-scale from zero to hundreds of workers** (Runpod Documentation)\n  * vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)\n\n\n\n**7) Baseten**\n\n  * Docs: set `min_replica = 0` to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)\n\n\n\n**8) Replicate (custom model deployments)**\n\n  * Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)\n\n\n\n* * *\n\n## Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)\n\nIf you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:\n\n  * **Together AI** (serverless + also “Dedicated Endpoints”) (Together AI)\n  * **Fireworks** (serverless token pricing; also on-demand deployments) (Fireworks AI)\n  * **GroqCloud** (hosted inference; pricing published) (Groq)\n  * **DeepInfra** (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)\n\n\n\nThese are common in production when:\n\n  * you can accept a specific model set, and\n  * you want low ops overhead and predictable performance.\n\n\n\n* * *\n\n## How to choose (simple decision logic)\n\n### If you must run **a specific Hub repo**\n\nPick a **BYO-weights** option:\n\n  * HF Inference Endpoints (Hugging Face)\n  * Cloud Run GPUs (Google Cloud Documentation)\n  * Azure Container Apps serverless GPUs (Microsoft Learn)\n  * Modal / RunPod / Baseten / Replicate (Modal)\n\n\n\n### If you can use “close enough” models\n\nUse a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)\n\n### If you’re truly serverless (scale-to-zero) and user-facing latency matters\n\nPlan for cold start mitigation:\n\n  * keep **min replicas = 1** during business hours, or\n  * do “wake-up requests,” or\n  * pick smaller/quantized models and an engine like vLLM/TGI.\n\n\n\nCloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation)\nBaseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)",
  "title": "How are you deploying HF models that don’t have inference providers?"
}