Vector Store API calls returning 504s, 503s, and generally being slow
Con:
This is completely disrupting my chat
Given that this workflow operated successfully for months and now fails without major logic changes, this looks more like a service-side scalability/regression issue than an application-layer bug.
And a few things that stands out:
the failures that occur specifically during paginated listing
well both 503 overload and 504 timeout responses are appearing
retry-after: 120suggests the backend is explicitly signaling load pressureand the issue appears intermittent rather than deterministic
So one possible explanation is that vector store file listing performance/regression has degraded for larger stores, causing pagination requests to time out under load.
And a few mitigation ideas worth testing meanwhile
reduce pagination size (
limit: 25or50)add exponential backoff + jitter for 503/504 responses
persist incremental sync state instead of full-store cycling
avoid clearing/reloading entire vector stores weekly if possible
stagger cron execution windows if multiple stores sync simultaneously
So it may also help to log:
vector store file counts
response latency per page
whether failures correlate with larger stores/page depths
And the fact this worked reliably for months is probably the most important signal here
Discussion in the ATmosphere