Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreib2xp3ttrfnwry4xtz5v7jq4gfakg7bbkek2qopyc6yq34xi6haky",
    "uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3melrhpidqno2"
  },
  "path": "/t/high-network-latency-500ms-when-calling-vllm-gemma-27b-from-india-to-atlanta-server-any-optimization-options/173352#post_1",
  "publishedAt": "2026-02-11T13:32:21.000Z",
  "site": "https://discuss.huggingface.co",
  "textContent": "Hi everyone,\n\nI am running Gemma-3-27B-IT using vLLM serve on a GPU server located in Atlanta (US).\n\nMy request backend is located in India, and I’m sending inference requests over the public internet.\n\nObservations:\n\n  * Model inference time: ~200 ms\n  * Network latency (round trip): ~500 ms\n  * Total response time: ~700 ms\n  * Using HTTP API (not WebSocket)\n  * Standard vLLM serve command with chunked prefill + fp8 quantization\n\n\n\nThe 500 ms seems to be purely network latency between India and Atlanta.\n\nQuestions:\n\n  1. Is this latency expected for India ↔ US East traffic?\n  2. Would switching to WebSockets meaningfully reduce latency?\n  3. Would placing FastAPI in the same VPC/region as vLLM reduce overall delay significantly?\n  4. Has anyone optimized cross-continent LLM inference setups successfully?\n  5. Are there networking tricks (persistent connections, HTTP/2, Anycast, CDN, etc.) that help in this scenario?\n\n\n\nGoal:\nI’m targeting near-real-time responses (<300 ms total), so I’m evaluating whether architecture changes are required.\n\nAny insights or real-world experiences would be very helpful.\n\nThanks!",
  "title": "High Network Latency (500ms) When Calling vLLM Gemma-27B from India to Atlanta Server – Any Optimization Options?"
}