[Usage]: GetTimeoutError when run distributed inference on ray with tensor parallel size > 1 #9694
Open
1 task done
Labels
usage
How to use vllm
Your current environment
How would you like to use vllm
I am trying the https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_distributed.py example.
Then I met a GetTimeoutError as follow:
The LLM engine seemed to be loaded correctly.
The GPU memory usage was unexpectedly large when using
TinyLlama/TinyLlama-1.1B-Chat-v1.0
which is a small model. (I use this model to reduce the time of loading models and have a quick test.)I want to know what is causing this error and how to fix it. This problem blocks me for days.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: