-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow text generation on dual Arc A770's w/ vLLM #12190
Comments
Just wanted to update and say that I removed the pipeline-parallel-size line as it was throwing errors about device_id's, but that the text generation speed still hasn't gone above 5t/s. |
Hi, I am trying to reproduce this issue in my environment. Will update to this thread. |
Hi, could you tell me how you tested the performance mentioned in the thread(around 14-15t/s and 8-9t/s, respectively)? Also could you share the results of this script? https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh |
Hi there.
|
Apologies, the dual GPU scores were determined the same way as the single GPU scores: Single GPU: Inference 14t/s Text Gen 8t/s |
Hi, can you check if you have installed the out-of-tree driver on the host? You can check it through the following command: apt list | grep i915
# WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
# Example output if this has been installed...
# intel-i915-dkms/unknown 1.23.10.72.231129.76+i112-1 all [upgradable from: 1.23.10.54.231129.55+i87-1] If you have not installed the out of tree driver, you can install it with our guide: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver Besides, could you provide me with a executable bash script? I want to ensure that the commands I execute are exactly the same as yours, including the server startup and testing scripts, as well as the scripts for running inference and text generation. |
The i915 driver is loaded on my unRAID host system. The exact script I use to run vLLM is below: export CCL_WORKER_COUNT=2 export USE_XETLA=OFF source /opt/intel/1ccl-wks/setvars.sh python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server Ignore the high max model length, batched tokens and seqs; I've found that they do very little to improve perf. |
Could you please share the client script/code to send requests to the vLLM API server, so we can use it to get the inference/text generation TPS same as you. |
I use openWebUI to connect to the vLLM OpenAI Capable API server, so it would follow the standard scheme from there. Ignore the goofy 'in hoodlum speak', that's for generating conversation topic names in OpenWebUI, which I found funny |
Hi, I have tested the performance of the vLLM serving engine using your prompt. What we got is:
The command for starting the test is listed below: # For starting server:
#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct/"
served_model_name="Llama-3.1"
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 8192 \
--max-num-batched-tokens 10000 \
--max-num-seqs 256 \
--block-size 8 \
--tensor-parallel-size 2 # Change this to 1 for single card serving... For test, you can use the vllm_online_benchmark.py script in the container... And change this line https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/serving/xpu/docker/vllm_online_benchmark.py#L459 to the prompt. For instance: |
Besides, to test the performance of vLLM engine, we recommend to use |
Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell. My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall. |
This test is testing the text generation performance. As far as I can tell, the OpenWebUI just sends request to the vLLM OpenAI API sevrer, which is exactly what The 404 error might be caused by:
|
Adding a GPU does benefit the computation but also brings additional communication overhead, for 7B/8B/9B LLM models, the optimization time for computation on the next token is not enough to outweigh the communication overhead, so the next token latency will be slightly slower on dual cards than on single card, but the 1st token will be a lot faster, and what you mentioned slowed things down overall is what you observe with batch=1, short inputs, and next token latency accounting for a large portion of overall performance. Once you increase the batch i.e. serve multiple requests in parallel, or you use longer inputs, you will find that the throughput of overall is better for dual cards than single. BTW, for 7B/8B/9B LLM models, if you don't need to serve super long inputs or need faster 1st token latency, we recommend deploying one per card for the LLM service, by deploying multiple instances, we can retain the next token latency advantage of single card and get 2X throughput. For the ~14B model we recommend using two ARC A770s, and for the ~33B model, we recommend using four ARC A770s. |
Thank you very much for this advice. I did notice something very strange though. I decided to spin up a Llama.cpp instance using the ipex-llm guides and noticed SIGNIFICANTLY faster generation, which was in fact utilising both cards to generate. Inference speed was not as fast as vLLM, however text generation was the fastest I had seen overall. Why might this be happening? I can achieve incredible results with llama.cpp but vLLM, which should be faster, is performing at 4-5 tokens per second for text generation, with the same model loaded. |
Hi @HumerousGorgon, what's the difference between inference and text generation you mentioned here? Can you observe data as below? Or are there any other metrics that help you differentiate between inference and text generation?
|
I'm going to run those benchmarks now to see whether my performance is in line with the numbers here. My problem is not with inference, it's with the generation of text. With vLLM it was extremely slow, a couple of words every second (4-5t/s) but with Llama.cpp and the exact same model, I was seeing it generate whole sentences in a second. |
We use the same method to use openwebui as the front end to deploy the llama3.1-8b model. The text generation speed of the web interface is very fast, and we can respond immediately after sending questions. The specific values of the test show that on one card it is about 40t/s, while on two cards it can stably reach more than 45t/s. For the 128-length prompt, the test results are as follows: |
Okay, something very wrong is happening with my vLLM instance, because I am seeing NOWHERE near those numbers. You are getting roughly 10x the performance I am. Could you share your script to start it? |
We have updated our openwebui+vllm-serving workflow here. Docker start scipt is here, please update the docker image before start it. Frontend and backend startup scripts are as follows, note change vLLM Serving start script:#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct"
served_model_name="Meta-Llama-3.1-8B-Instruct"
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit fp8 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--max-num-seqs 12 \
--api-key <api-key> \
--tensor-parallel-size 2 open-webui start script:#!/bin/bash
export DOCKER_IMAGE=ghcr.io/open-webui/open-webui:main
export CONTAINER_NAME=open-webui
docker rm -f $CONTAINER_NAME
docker run -itd \
-p 3000:8080 \
-e OPENAI_API_KEY=<api-key> \
-e OPENAI_API_BASE_URL=http://<vllm-host-ip>:8000/v1 \
-v open-webui:/app/backend/data \
--name $CONTAINER_NAME \
--restart always $DOCKER_IMAGE |
Hey there. I used this script exactly as you had put it, updated my container to the latest version... |
As you can see, even with the newest branch, this is just flat out not working the way everyone else's is. I'm using unRAID as my host machine. There is one difference: in order to get my GPU is that I have to pass it two environment variables: This is the only fundamental difference here. I'm starting to lose my mind. |
We recommend setting up your environment with Ubuntu 22.04 and Kernel 6.5, and installing the Intel-i915-dkms driver for optimal performance. Once you have Ubuntu OS set up, please follow the steps below to install Kernel 6.5: export VERSION="6.5.0-35"
sudo apt-get install -y linux-headers-$VERSION-generic
sudo apt-get install -y linux-image-$VERSION-generic
sudo apt-get install -y linux-modules-$VERSION-generic # may not be needed
sudo apt-get install -y linux-modules-extra-$VERSION-generic After installing, you can configure GRUB to use the new kernel by running the following: sudo sed -i "s/GRUB_DEFAULT=.*/GRUB_DEFAULT=\"1> $(echo $(($(awk -F\' '/menuentry / {print $2}' /boot/grub/grub.cfg \
| grep -no $VERSION | sed 's/:/\n/g' | head -n 1)-2)))\"/" /etc/default/grub Then, update GRUB and reboot your system: sudo update-grub
sudo reboot For detailed instructions on installing the driver, you can follow this link: Please feel free to reach out if you have any questions! |
please also make sure "re-sizeable BAR support" and "above 4G mmio" are enabled. |
It might be related to the CPU/GPU frequency. You can try adjusting the CPU/GPU frequency to see if it has any impact. For CPU frequency, you can use analyzing CPU 30:
driver: intel_pstate
CPUs which run at the same hardware frequency: 30
CPUs which need to have their frequency coordinated by software: 30
maximum transition latency: Cannot determine or is not supported.
hardware limits: 800 MHz - 4.50 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 3.80 GHz and 4.50 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.80 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes We set the CPU frequency to 3.8GHz for optimal performance: sudo cpupower frequency-set -d 3.8GHz For GPU, you can set the frequency using the following commands: sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400 Let us know if adjusting the frequencies helps improve the performance! |
So I went ahead and purchased a WHOLE ENTIRE NEW SYSTEM |
Also, removing tensor parallel results in 22 tokens per second, so I'm still receiving half of the performance that other group members are experiencing. |
And finally, setting cpu frequency to 3.8GHz did for a single second set the output speed to 40 tokens per second, but it then dropped back down to 20 tokens per second |
Finally finally, I had changed the governer to performance, still no change. |
After starting vllm serving according to the above method, you can directly use python vllm_online_benchmark.py Meta-Llama-3.1-8B-Instruct 8 1024 512 What's more, you need make sure the port in ...
model="/llm/models/llama-3-1-instruct"
served_model_name="llama-3-1-instruct"
... |
Have you set the GPU frequency using the following commands: |
Is xpu-smi supported for Arc series GPUs? |
Yes, you can follow this guide https://github.com/intel/xpumanager/blob/master/doc/smi_install_guide.md to install xpu-smi. |
Hello everyone! Okay.. after reinstalling with Ubuntu 22.04.4, installing Kernel 6.5, installing the DKMS drivers, manually changing the CPU and GPU frequencies.. I'm finally getting the correct performance. Using a single GPU I get 40-42 tokens per second and dual gpu I get 38-41 tokens per second. The drop in performance with the second GPU is likely just a link speed difference between them, which is nothing. Thank you everyone for your help during this. I appreciate your patience and extended support! |
Closing this issue. Feel free to tell us if you have further problems! |
Hello!
Followed the quickstart guide regarding vLLM serving through the available Docker image.
I'm using 2 x Arc A770's in my system.
When configured and running on a single GPU, inference speed is fantastic and text generation speed is good (around 14-15t/s and 8-9t/s, respectively). When setting tensor_parallel_size and pipeline_parallel_size to 2 to scale to both GPUs, inference speed doubles, however text generation speed halves, down to 3-4t/s.
Below is my start-vllm-service.sh config:
#!/bin/bash
model="/llm/models/llama-3-1-instruct"
served_model_name="Llama-3.1"
export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 8192
--max-num-batched-tokens 10000
--max-num-seqs 256
--block-size 8
--tensor-parallel-size 2
--pipeline-parallel-size 2
Maybe I'm missing something, maybe I'm not. I did read to set the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to 1 for a performance boost, but set back to 2 during troubleshooting.
Thanks for taking the time to read!
Hoping someone has an answer.
The text was updated successfully, but these errors were encountered: