Inference Endpoint Hanging in "Initializing"
Hello! I’ve been running into a problem with the inference endpoints and I was wondering if anyone has any suggestions.
Problem Description
I’ve been had three models deployed on Inference Endpoints for about a year now. They are all using A100 GPUs hosted on AWS and scale to zero when not in use. In the past week I’ve had the endpoints not get past the initialization step without manual intervention. This problem is intermittent.
I know that the scaling up means that the endpoints need to wait until AWS hardware is available, but the issue is that the endpoints sometimes never seem to get past initialization on their own. I need to manually go in, force scale to zero, then start them back up. After the manual intervention they always get past initialization which makes me think this issue is not just waiting for hardware to be available.
At first we saw this just once in a 6 month period, then suddenly it happens multiple times a day.
Additional Information
Logs
No logs are generated, the endpoint hangs without seemingly even entering the inference engine container.
System Configuration
Models: LLama 3.1 8B custom varient
GPU: A100
Vendor: AWS
Inference Engine Container: hommayushi3/vllm-huggingface
Scaling: Scale to zero when not in use
Discussion in the ATmosphere