External Publication
Visit Post

High Network Latency (500ms) When Calling vLLM Gemma-27B from India to Atlanta Server – Any Optimization Options?

Hugging Face Forums [Unofficial] February 11, 2026
Source
Hi everyone, I am running Gemma-3-27B-IT using vLLM serve on a GPU server located in Atlanta (US). My request backend is located in India, and I’m sending inference requests over the public internet. Observations: * Model inference time: ~200 ms * Network latency (round trip): ~500 ms * Total response time: ~700 ms * Using HTTP API (not WebSocket) * Standard vLLM serve command with chunked prefill + fp8 quantization The 500 ms seems to be purely network latency between India and Atlanta. Questions: 1. Is this latency expected for India ↔ US East traffic? 2. Would switching to WebSockets meaningfully reduce latency? 3. Would placing FastAPI in the same VPC/region as vLLM reduce overall delay significantly? 4. Has anyone optimized cross-continent LLM inference setups successfully? 5. Are there networking tricks (persistent connections, HTTP/2, Anycast, CDN, etc.) that help in this scenario? Goal: I’m targeting near-real-time responses (<300 ms total), so I’m evaluating whether architecture changes are required. Any insights or real-world experiences would be very helpful. Thanks!

Discussion in the ATmosphere

Loading comments...