How are you deploying HF models that don’t have inference providers?
for now, popular options.
How teams deploy these models in real projects (the common pattern)
Even if a model has no attached provider, most teams still deploy it by doing:
Pull weights from the Hub (private/gated with an HF token if needed)
Serve the model with a standard inference server (LLMs: vLLM or HF TGI are common)
Run that server on a platform that matches their cost/latency needs :
- serverless / scale-to-zero for spiky traffic
- always-on for consistent low latency
Two serving stacks you’ll see repeatedly:
vLLM : widely used for LLM inference/serving and optimized throughput (GitHub)
Hugging Face TGI (Text Generation Inference) : used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)
- TGI supports a Messages API compatible with OpenAI Chat Completions , which simplifies integration and swapping providers (Hugging Face)
Serverless / scale-to-zero options (HF and non-HF)
These are the most common “serverless” ways people deploy Hub models that don’t have attached providers.
A) Hugging Face-managed
1) Inference Endpoints (autoscaling + scale-to-zero)
- HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)
- Expect cold starts when scaled to 0 (first request after idle wakes it).
2) Spaces (prototype / light production)
- Free CPU tier: 16GB RAM, 2 CPU cores, 50GB ephemeral disk (Hugging Face)
- GPU upgrades are available (paid) (Hugging Face)
- Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.
B) “Real serverless” GPU containers (you bring a Docker image)
3) Google Cloud Run + GPUs
- Official docs: GPU services can scale down to zero for cost savings (Google Cloud Documentation)
- Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)
- Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)
4) Azure Container Apps “serverless GPUs”
- Microsoft explicitly positions this as automatic scaling + per-second billing + scale down to zero (Microsoft Learn)
- Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)
C) Serverless inference platforms (popular for “BYO model weights”)
5) Modal
- Modal docs: functions scale to zero by default when idle (Modal)
- Strong DX if your team likes “code-first” deployments.
6) RunPod Serverless
- Docs: endpoints can auto-scale from zero to hundreds of workers (Runpod Documentation)
- vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)
7) Baseten
- Docs: set
min_replica = 0to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)
8) Replicate (custom model deployments)
- Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)
Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)
If you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:
- Together AI (serverless + also “Dedicated Endpoints”) (Together AI)
- Fireworks (serverless token pricing; also on-demand deployments) (Fireworks AI)
- GroqCloud (hosted inference; pricing published) (Groq)
- DeepInfra (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)
These are common in production when:
- you can accept a specific model set, and
- you want low ops overhead and predictable performance.
How to choose (simple decision logic)
If you must run a specific Hub repo
Pick a BYO-weights option:
- HF Inference Endpoints (Hugging Face)
- Cloud Run GPUs (Google Cloud Documentation)
- Azure Container Apps serverless GPUs (Microsoft Learn)
- Modal / RunPod / Baseten / Replicate (Modal)
If you can use “close enough” models
Use a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)
If you’re truly serverless (scale-to-zero) and user-facing latency matters
Plan for cold start mitigation:
- keep min replicas = 1 during business hours, or
- do “wake-up requests,” or
- pick smaller/quantized models and an engine like vLLM/TGI.
Cloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation) Baseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)
Discussion in the ATmosphere