External Publication
Visit Post

How are you deploying HF models that don’t have inference providers?

Hugging Face Forums [Unofficial] February 13, 2026
Source

for now, popular options.


How teams deploy these models in real projects (the common pattern)

Even if a model has no attached provider, most teams still deploy it by doing:

  1. Pull weights from the Hub (private/gated with an HF token if needed)

  2. Serve the model with a standard inference server (LLMs: vLLM or HF TGI are common)

  3. Run that server on a platform that matches their cost/latency needs :

    • serverless / scale-to-zero for spiky traffic
    • always-on for consistent low latency

Two serving stacks you’ll see repeatedly:

  • vLLM : widely used for LLM inference/serving and optimized throughput (GitHub)

  • Hugging Face TGI (Text Generation Inference) : used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)

    • TGI supports a Messages API compatible with OpenAI Chat Completions , which simplifies integration and swapping providers (Hugging Face)

Serverless / scale-to-zero options (HF and non-HF)

These are the most common “serverless” ways people deploy Hub models that don’t have attached providers.

A) Hugging Face-managed

1) Inference Endpoints (autoscaling + scale-to-zero)

  • HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)
  • Expect cold starts when scaled to 0 (first request after idle wakes it).

2) Spaces (prototype / light production)

  • Free CPU tier: 16GB RAM, 2 CPU cores, 50GB ephemeral disk (Hugging Face)
  • GPU upgrades are available (paid) (Hugging Face)
  • Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.

B) “Real serverless” GPU containers (you bring a Docker image)

3) Google Cloud Run + GPUs

  • Official docs: GPU services can scale down to zero for cost savings (Google Cloud Documentation)
  • Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)
  • Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)

4) Azure Container Apps “serverless GPUs”

  • Microsoft explicitly positions this as automatic scaling + per-second billing + scale down to zero (Microsoft Learn)
  • Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)

C) Serverless inference platforms (popular for “BYO model weights”)

5) Modal

  • Modal docs: functions scale to zero by default when idle (Modal)
  • Strong DX if your team likes “code-first” deployments.

6) RunPod Serverless

  • Docs: endpoints can auto-scale from zero to hundreds of workers (Runpod Documentation)
  • vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)

7) Baseten

  • Docs: set min_replica = 0 to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)

8) Replicate (custom model deployments)

  • Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)

Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)

If you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:

  • Together AI (serverless + also “Dedicated Endpoints”) (Together AI)
  • Fireworks (serverless token pricing; also on-demand deployments) (Fireworks AI)
  • GroqCloud (hosted inference; pricing published) (Groq)
  • DeepInfra (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)

These are common in production when:

  • you can accept a specific model set, and
  • you want low ops overhead and predictable performance.

How to choose (simple decision logic)

If you must run a specific Hub repo

Pick a BYO-weights option:

  • HF Inference Endpoints (Hugging Face)
  • Cloud Run GPUs (Google Cloud Documentation)
  • Azure Container Apps serverless GPUs (Microsoft Learn)
  • Modal / RunPod / Baseten / Replicate (Modal)

If you can use “close enough” models

Use a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)

If you’re truly serverless (scale-to-zero) and user-facing latency matters

Plan for cold start mitigation:

  • keep min replicas = 1 during business hours, or
  • do “wake-up requests,” or
  • pick smaller/quantized models and an engine like vLLM/TGI.

Cloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation) Baseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)

Discussion in the ATmosphere

Loading comments...