{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicounvguee4bdfxk6ghvmhixvqw2w5rzkddhazcc3gazwindx5nwi",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3meyeevjha7k2"
  },
  "path": "/t/how-are-you-deploying-hf-models-that-don-t-have-inference-providers/172964#post_5",
  "publishedAt": "2026-02-16T15:08:40.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "One pattern we’ve seen is that “serverless” often works well for experimentation, but once traffic stabilizes, teams start optimizing for predictability rather than pure scale-to-zero behavior.\n\nCold starts and GPU spin-up time can become more operationally expensive than the compute itself, especially for user-facing workloads.\n\nA lot of deployments end up hybrid: serverless for spiky jobs, and warm capacity for anything latency-sensitive.",
  "title": "How are you deploying HF models that don’t have inference providers?"
}