[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

source-ram · 2024-10-24T17:24:44Z

Your current environment

Running via Docker

docker run --runtime nvidia --gpus \"device=${CUDA_VISIBLE_DEVICES}\"         --shm-size 8g     -v $volume:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=***"     -p 5005:5005     --ipc=host     vllm/vllm-openai:v0.6.3.post1  --model nvidia/Llama-3.1-Nemotron-70B-Instruct-HF  --trust-remote-code  --tensor-parallel-size 4 --port 5005

Model Input Dumps

No response

🐛 Describe the bug

Note : The issue is not seen in release v0.6.2

From release 0.6.3, any input larger than 32K tokens, the model out is garbage.
Model deployed is : nvidia/Llama-3.1-Nemotron-70B-Instruct-HF with tensor-parallel-size 4 on 4 A100 GPU servers
When i rolled back to 0.6.2 release the issue disappeared & the model is stable still 130K input token without any issue.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

jeejeelee · 2024-10-25T03:15:06Z

You can try using enforce-eager to verify if the issue is caused by cudagraph

HuggingAha · 2024-10-25T03:54:30Z

For vllm=0.6.3, not only does inference with text exceeding 32K tokens result in garbled output, but garbled output can also occur with inputs around 1K tokens in high concurrency scenarios. This issue persists even when using enforce-eager. However, when testing the problematic prompt separately with the same parameters, no garbled output is produced.

jeejeelee · 2024-10-25T05:56:31Z

if you run into this issue in eager mode as well, then it might not be due to that reason(cudagraph). BTW, perhaps you can refer to :#9581 (comment) to catch the root cause

sir3mat · 2024-10-25T10:42:16Z

same behaviour with llama 3.1 70B 128k context and llama 3.2 3B 128K context

source-ram added the bug Something isn't working label Oct 24, 2024

Jingyu6 mentioned this issue Oct 27, 2024

[Bug]: Inconsistent evaluations when enabling / disabling chunked_prefill? #9706

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

source-ram commented Oct 24, 2024

jeejeelee commented Oct 25, 2024

HuggingAha commented Oct 25, 2024 •

edited

Loading

jeejeelee commented Oct 25, 2024 •

edited

Loading

sir3mat commented Oct 25, 2024

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2) #9670

Comments

source-ram commented Oct 24, 2024

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

jeejeelee commented Oct 25, 2024

HuggingAha commented Oct 25, 2024 • edited Loading

jeejeelee commented Oct 25, 2024 • edited Loading

sir3mat commented Oct 25, 2024

HuggingAha commented Oct 25, 2024 •

edited

Loading

jeejeelee commented Oct 25, 2024 •

edited

Loading