High Network Latency (500ms) When Calling vLLM Gemma-27B from India to Atlanta Server – Any Optimization Options?
Hugging Face Forums [Unofficial]
February 11, 2026
Hi everyone,
I am running Gemma-3-27B-IT using vLLM serve on a GPU server located in Atlanta (US).
My request backend is located in India, and I’m sending inference requests over the public internet.
Observations:
* Model inference time: ~200 ms
* Network latency (round trip): ~500 ms
* Total response time: ~700 ms
* Using HTTP API (not WebSocket)
* Standard vLLM serve command with chunked prefill + fp8 quantization
The 500 ms seems to be purely network latency between India and Atlanta.
Questions:
1. Is this latency expected for India ↔ US East traffic?
2. Would switching to WebSockets meaningfully reduce latency?
3. Would placing FastAPI in the same VPC/region as vLLM reduce overall delay significantly?
4. Has anyone optimized cross-continent LLM inference setups successfully?
5. Are there networking tricks (persistent connections, HTTP/2, Anycast, CDN, etc.) that help in this scenario?
Goal:
I’m targeting near-real-time responses (<300 ms total), so I’m evaluating whether architecture changes are required.
Any insights or real-world experiences would be very helpful.
Thanks!
Discussion in the ATmosphere