Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

Open
1 task done
source-ram opened this issue Oct 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@source-ram
Copy link

Your current environment

Running via Docker
docker run --runtime nvidia --gpus \"device=${CUDA_VISIBLE_DEVICES}\"         --shm-size 8g     -v $volume:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=***"     -p 5005:5005     --ipc=host     vllm/vllm-openai:v0.6.3.post1  --model nvidia/Llama-3.1-Nemotron-70B-Instruct-HF  --trust-remote-code  --tensor-parallel-size 4 --port 5005

Model Input Dumps

No response

🐛 Describe the bug

Note : The issue is not seen in release v0.6.2

From release 0.6.3, any input larger than 32K tokens, the model out is garbage.
Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers
When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@source-ram source-ram added the bug Something isn't working label Oct 24, 2024
@jeejeelee
Copy link
Contributor

You can try using enforce-eager to verify if the issue is caused by cudagraph

@HuggingAha
Copy link

HuggingAha commented Oct 25, 2024

For vllm=0.6.3, not only does inference with text exceeding 32K tokens result in garbled output, but garbled output can also occur with inputs around 1K tokens in high concurrency scenarios. This issue persists even when using enforce-eager. However, when testing the problematic prompt separately with the same parameters, no garbled output is produced.

@jeejeelee
Copy link
Contributor

jeejeelee commented Oct 25, 2024

if you run into this issue in eager mode as well, then it might not be due to that reason(cudagraph). BTW, perhaps you can refer to :#9581 (comment) to catch the root cause

@sir3mat
Copy link

sir3mat commented Oct 25, 2024

same behaviour with llama 3.1 70B 128k context and llama 3.2 3B 128K context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants