You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From release 0.6.3, any input larger than 32K tokens, the model out is garbage.
Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers
When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
For vllm=0.6.3, not only does inference with text exceeding 32K tokens result in garbled output, but garbled output can also occur with inputs around 1K tokens in high concurrency scenarios. This issue persists even when using enforce-eager. However, when testing the problematic prompt separately with the same parameters, no garbled output is produced.
if you run into this issue in eager mode as well, then it might not be due to that reason(cudagraph). BTW, perhaps you can refer to :#9581 (comment) to catch the root cause
Your current environment
Running via Docker
Model Input Dumps
No response
🐛 Describe the bug
Note : The issue is not seen in release v0.6.2
From release 0.6.3, any input larger than 32K tokens, the model out is garbage.
Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers
When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: