{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreib2xp3ttrfnwry4xtz5v7jq4gfakg7bbkek2qopyc6yq34xi6haky",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3melrhpidqno2"
},
"path": "/t/high-network-latency-500ms-when-calling-vllm-gemma-27b-from-india-to-atlanta-server-any-optimization-options/173352#post_1",
"publishedAt": "2026-02-11T13:32:21.000Z",
"site": "https://discuss.huggingface.co",
"textContent": "Hi everyone,\n\nI am running Gemma-3-27B-IT using vLLM serve on a GPU server located in Atlanta (US).\n\nMy request backend is located in India, and I’m sending inference requests over the public internet.\n\nObservations:\n\n * Model inference time: ~200 ms\n * Network latency (round trip): ~500 ms\n * Total response time: ~700 ms\n * Using HTTP API (not WebSocket)\n * Standard vLLM serve command with chunked prefill + fp8 quantization\n\n\n\nThe 500 ms seems to be purely network latency between India and Atlanta.\n\nQuestions:\n\n 1. Is this latency expected for India ↔ US East traffic?\n 2. Would switching to WebSockets meaningfully reduce latency?\n 3. Would placing FastAPI in the same VPC/region as vLLM reduce overall delay significantly?\n 4. Has anyone optimized cross-continent LLM inference setups successfully?\n 5. Are there networking tricks (persistent connections, HTTP/2, Anycast, CDN, etc.) that help in this scenario?\n\n\n\nGoal:\nI’m targeting near-real-time responses (<300 ms total), so I’m evaluating whether architecture changes are required.\n\nAny insights or real-world experiences would be very helpful.\n\nThanks!",
"title": "High Network Latency (500ms) When Calling vLLM Gemma-27B from India to Atlanta Server – Any Optimization Options?"
}