External Publication

Visit Post

How are you deploying HF models that don’t have inference providers?

Hugging Face Forums [Unofficial] February 13, 2026

Source

for now, popular options.

How teams deploy these models in real projects (the common pattern)

Even if a model has no attached provider, most teams still deploy it by doing:

Pull weights from the Hub (private/gated with an HF token if needed)
Serve the model with a standard inference server (LLMs: vLLM or HF TGI are common)
Run that server on a platform that matches their cost/latency needs :
- serverless / scale-to-zero for spiky traffic
- always-on for consistent low latency

Two serving stacks you’ll see repeatedly:

vLLM : widely used for LLM inference/serving and optimized throughput (GitHub)
Hugging Face TGI (Text Generation Inference) : used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)
- TGI supports a Messages API compatible with OpenAI Chat Completions , which simplifies integration and swapping providers (Hugging Face)

Serverless / scale-to-zero options (HF and non-HF)

These are the most common “serverless” ways people deploy Hub models that don’t have attached providers.

A) Hugging Face-managed

1) Inference Endpoints (autoscaling + scale-to-zero)

HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)
Expect cold starts when scaled to 0 (first request after idle wakes it).

2) Spaces (prototype / light production)

Free CPU tier: 16GB RAM, 2 CPU cores, 50GB ephemeral disk (Hugging Face)
GPU upgrades are available (paid) (Hugging Face)
Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.

B) “Real serverless” GPU containers (you bring a Docker image)

3) Google Cloud Run + GPUs

Official docs: GPU services can scale down to zero for cost savings (Google Cloud Documentation)
Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)
Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)

4) Azure Container Apps “serverless GPUs”

Microsoft explicitly positions this as automatic scaling + per-second billing + scale down to zero (Microsoft Learn)
Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)

C) Serverless inference platforms (popular for “BYO model weights”)

5) Modal

Modal docs: functions scale to zero by default when idle (Modal)
Strong DX if your team likes “code-first” deployments.

6) RunPod Serverless

Docs: endpoints can auto-scale from zero to hundreds of workers (Runpod Documentation)
vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)

7) Baseten

Docs: set min_replica = 0 to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)

8) Replicate (custom model deployments)

Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)

Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)

If you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:

Together AI (serverless + also “Dedicated Endpoints”) (Together AI)
Fireworks (serverless token pricing; also on-demand deployments) (Fireworks AI)
GroqCloud (hosted inference; pricing published) (Groq)
DeepInfra (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)

These are common in production when:

you can accept a specific model set, and
you want low ops overhead and predictable performance.

How to choose (simple decision logic)

If you must run a specific Hub repo

Pick a BYO-weights option:

HF Inference Endpoints (Hugging Face)
Cloud Run GPUs (Google Cloud Documentation)
Azure Container Apps serverless GPUs (Microsoft Learn)
Modal / RunPod / Baseten / Replicate (Modal)

If you can use “close enough” models

Use a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)

If you’re truly serverless (scale-to-zero) and user-facing latency matters

Plan for cold start mitigation:

keep min replicas = 1 during business hours, or
do “wake-up requests,” or
pick smaller/quantized models and an engine like vLLM/TGI.

Cloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation) Baseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)