Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow text generation on dual Arc A770's w/ vLLM #12190

Closed
HumerousGorgon opened this issue Oct 13, 2024 · 36 comments
Closed

Slow text generation on dual Arc A770's w/ vLLM #12190

HumerousGorgon opened this issue Oct 13, 2024 · 36 comments

Comments

@HumerousGorgon
Copy link

Hello!

Followed the quickstart guide regarding vLLM serving through the available Docker image.
I'm using 2 x Arc A770's in my system.
When configured and running on a single GPU, inference speed is fantastic and text generation speed is good (around 14-15t/s and 8-9t/s, respectively). When setting tensor_parallel_size and pipeline_parallel_size to 2 to scale to both GPUs, inference speed doubles, however text generation speed halves, down to 3-4t/s.

Below is my start-vllm-service.sh config:
#!/bin/bash
model="/llm/models/llama-3-1-instruct"
served_model_name="Llama-3.1"

export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit sym_int4
--max-model-len 8192
--max-num-batched-tokens 10000
--max-num-seqs 256
--block-size 8
--tensor-parallel-size 2
--pipeline-parallel-size 2

Maybe I'm missing something, maybe I'm not. I did read to set the SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS to 1 for a performance boost, but set back to 2 during troubleshooting.

Thanks for taking the time to read!
Hoping someone has an answer.

@HumerousGorgon
Copy link
Author

Just wanted to update and say that I removed the pipeline-parallel-size line as it was throwing errors about device_id's, but that the text generation speed still hasn't gone above 5t/s.

@gc-fu
Copy link
Contributor

gc-fu commented Oct 14, 2024

Hi, I am trying to reproduce this issue in my environment. Will update to this thread.

@gc-fu
Copy link
Contributor

gc-fu commented Oct 14, 2024

Hi, could you tell me how you tested the performance mentioned in the thread(around 14-15t/s and 8-9t/s, respectively)?

Also could you share the results of this script?

https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh

@HumerousGorgon
Copy link
Author

HumerousGorgon commented Oct 16, 2024

Hi there.
Here is the output of the script:

PYTHON_VERSION=3.11.10

transformers=4.44.2

torch=2.1.0.post2+cxx11.abi

ipex-llm DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/oneccl_bind_pt-2.1.300+xpu-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at pypa/pip#12330
DEPRECATION: Loading egg at /usr/local/lib/python3.11/dist-packages/vllm-0.5.4+xpu-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at pypa/pip#12330
Version: 2.2.0b20241011

ipex=2.1.30.post0

CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 36
On-line CPU(s) list: 0-35
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
CPU family: 6
Model: 79
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 1
Stepping: 1
CPU max MHz: 3300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4197.80

Total CPU Memory: 125.722 GB

Operating System:
Ubuntu 22.04.4 LTS \n \l


Linux neutronserver 6.8.12-Unraid #3 SMP PREEMPT_DYNAMIC Tue Jun 18 07:52:57 PDT 2024 x86_64 x86_64 x86_64 GNU/Linux

CLI:
Version: 1.2.13.20230704
Build ID: 00000000

Service:
Version: 1.2.13.20230704
Build ID: 00000000
Level Zero Version: 1.14.0

Driver Version 2024.17.5.0.08_160000.xmain-hotfix
Driver Version 2024.17.5.0.08_160000.xmain-hotfix
Driver UUID 32332e33-352e-3237-3139-312e39000000
Driver Version 23.35.27191.9
Driver UUID 32332e33-352e-3237-3139-312e39000000
Driver Version 23.35.27191.9

Driver related package version:
ii intel-level-zero-gpu 1.3.27191.9 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii level-zero-dev 1.14.0-744~22.04 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

igpu not detected

xpu-smi is properly installed.

+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel Corporation Device 56a0 (rev 08) |
| | Vendor Name: Intel(R) Corporation |
| | UUID: 00000000-0000-0003-0000-000856a08086 |
| | PCI BDF Address: 0000:03:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel Corporation Device 56a0 (rev 08) |
| | Vendor Name: Intel(R) Corporation |
| | UUID: 00000000-0000-0009-0000-000856a08086 |
| | PCI BDF Address: 0000:09:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
lspci: Unable to load libkmod resources: error -2
GPU0 Memory size=16G
GPU1 Memory size=16G

lspci: Unable to load libkmod resources: error -2
03:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 172f:4133
Flags: bus master, fast devsel, latency 0, IRQ 69, NUMA node 0, IOMMU group 59
Memory at fa000000 (64-bit, non-prefetchable) [size=16M]
Memory at 383800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at fb000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+

09:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 172f:4133
Flags: bus master, fast devsel, latency 0, IRQ 66, NUMA node 0, IOMMU group 52
Memory at f8000000 (64-bit, non-prefetchable) [size=16M]
Memory at 383000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at f9000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+

Checked the token performance by loading up Llama-3.1 8B and running the prompt "Tell me about yourself" 3 times, to determine performance after warmup.

@HumerousGorgon
Copy link
Author

Apologies, the dual GPU scores were determined the same way as the single GPU scores:
"Tell me about yourself" 3 seperate times as a prompt.

Single GPU: Inference 14t/s Text Gen 8t/s
Dual GPU: Inference 30-50t/s (I've seen up to 50, was crazy) Text Gen 4-5t/s.

@gc-fu
Copy link
Contributor

gc-fu commented Oct 17, 2024

Hi, can you check if you have installed the out-of-tree driver on the host?

You can check it through the following command:

apt list | grep i915

# WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

# Example output if this has been installed...
# intel-i915-dkms/unknown 1.23.10.72.231129.76+i112-1 all [upgradable from: 1.23.10.54.231129.55+i87-1]

If you have not installed the out of tree driver, you can install it with our guide: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-gpu-driver

Besides, could you provide me with a executable bash script? I want to ensure that the commands I execute are exactly the same as yours, including the server startup and testing scripts, as well as the scripts for running inference and text generation.

@HumerousGorgon
Copy link
Author

The i915 driver is loaded on my unRAID host system.
Other functions such as transcoding on GPU can be seen using the driver, so I know it is working.

The exact script I use to run vLLM is below:
#!/bin/bash
model="/llm/models/llama-3-1-instruct"
served_model_name="Llama-3.1"

export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server
--served-model-name $served_model_name
--port 8000
--model $model
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.9
--device xpu
--dtype float16
--enforce-eager
--enable-prefix-caching
--enable-chunked-prefill
--use-v2-block-manager
--load-in-low-bit fp8
--max-model-len 30000
--max-num-batched-tokens 40000
--max-num-seqs 512
--tensor-parallel-size 2

Ignore the high max model length, batched tokens and seqs; I've found that they do very little to improve perf.
I've also found that the load in low bit doesn't do anything either to the generated tokens.

@glorysdj
Copy link
Contributor

Could you please share the client script/code to send requests to the vLLM API server, so we can use it to get the inference/text generation TPS same as you.

@HumerousGorgon
Copy link
Author

I use openWebUI to connect to the vLLM OpenAI Capable API server, so it would follow the standard scheme from there.
This is the log about what is sent to the server when a chat request is sent from OpenWebUI.
INFO 10-17 09:59:17 logger.py:36] Received request chat-963e4e5e5afe443292c41933f907f9a7: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a world class A.I. model designed to generate high quality, reliable and true responses.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me about yourself<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=29941, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 2675, 527, 264, 1917, 538, 362, 2506, 13, 1646, 6319, 311, 7068, 1579, 4367, 11, 15062, 323, 837, 14847, 13, 128009, 128006, 882, 128007, 271, 41551, 757, 922, 6261, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.

Ignore the goofy 'in hoodlum speak', that's for generating conversation topic names in OpenWebUI, which I found funny

@gc-fu
Copy link
Contributor

gc-fu commented Oct 18, 2024

Hi, I have tested the performance of the vLLM serving engine using your prompt. What we got is:

# Single card
first token: 100.4761297517689
next token: 15.190188343637834

# Multi card
first token: 69.21667874485138
next token: 21.78218797010891

The command for starting the test is listed below:

# For starting server:
#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct/"
served_model_name="Llama-3.1"

export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--device xpu \
--dtype float16 \
--enforce-eager \
--load-in-low-bit sym_int4 \
--max-model-len 8192 \
--max-num-batched-tokens 10000 \
--max-num-seqs 256 \
--block-size 8 \
--tensor-parallel-size 2    # Change this to 1 for single card serving...

For test, you can use the vllm_online_benchmark.py script in the container...
Change this line to your model path: https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/serving/xpu/docker/vllm_online_benchmark.py#L435

And change this line https://github.com/intel-analytics/ipex-llm/blob/main/docker/llm/serving/xpu/docker/vllm_online_benchmark.py#L459 to the prompt.

For instance:
PROMPT = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a world class A.I. model designed to generate high quality, reliable and true responses.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me about yourself<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

@gc-fu
Copy link
Contributor

gc-fu commented Oct 18, 2024

Besides, to test the performance of vLLM engine, we recommend to use vllm_online_benchmark.py or benchmark_vllm_throughput.py

@HumerousGorgon
Copy link
Author

Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.

My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.

@gc-fu
Copy link
Contributor

gc-fu commented Oct 20, 2024

Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.

My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.

This test is testing the text generation performance. As far as I can tell, the OpenWebUI just sends request to the vLLM OpenAI API sevrer, which is exactly what vllm_online_benchmark.py does.

The 404 error might be caused by:

  1. proxy. Try setting export no_proxy="127.0.0.1,localhost"
  2. When starting vllm_online_benchmark.py, using the command: python3 vllm_online_benchmark.py Llama-3.1 1. Also tries to apply the two modifications I mentioned in previous thread.

@glorysdj
Copy link
Contributor

glorysdj commented Oct 21, 2024

Is that testing the inference performance or the text generation performance? I can't seem to get the vllm_online_benchmark working, just keeps throwing 404 errors on the server as far as I can tell.

My issue is with text generation speed. The script you provided did gain an extra token per second, but as far as I've seen things, adding an extra GPU has just slowed things down overall.

Adding a GPU does benefit the computation but also brings additional communication overhead, for 7B/8B/9B LLM models, the optimization time for computation on the next token is not enough to outweigh the communication overhead, so the next token latency will be slightly slower on dual cards than on single card, but the 1st token will be a lot faster, and what you mentioned slowed things down overall is what you observe with batch=1, short inputs, and next token latency accounting for a large portion of overall performance. Once you increase the batch i.e. serve multiple requests in parallel, or you use longer inputs, you will find that the throughput of overall is better for dual cards than single. BTW, for 7B/8B/9B LLM models, if you don't need to serve super long inputs or need faster 1st token latency, we recommend deploying one per card for the LLM service, by deploying multiple instances, we can retain the next token latency advantage of single card and get 2X throughput. For the ~14B model we recommend using two ARC A770s, and for the ~33B model, we recommend using four ARC A770s.

@HumerousGorgon
Copy link
Author

Thank you very much for this advice.

I did notice something very strange though. I decided to spin up a Llama.cpp instance using the ipex-llm guides and noticed SIGNIFICANTLY faster generation, which was in fact utilising both cards to generate. Inference speed was not as fast as vLLM, however text generation was the fastest I had seen overall.

Why might this be happening? I can achieve incredible results with llama.cpp but vLLM, which should be faster, is performing at 4-5 tokens per second for text generation, with the same model loaded.

@glorysdj
Copy link
Contributor

glorysdj commented Oct 21, 2024

Hi @HumerousGorgon, what's the difference between inference and text generation you mentioned here? Can you observe data as below? Or are there any other metrics that help you differentiate between inference and text generation?

# Single card
first token: 100.4761297517689
next token: 15.190188343637834
# Multi card
first token: 69.21667874485138
next token: 21.78218797010891

@HumerousGorgon
Copy link
Author

I'm going to run those benchmarks now to see whether my performance is in line with the numbers here. My problem is not with inference, it's with the generation of text. With vLLM it was extremely slow, a couple of words every second (4-5t/s) but with Llama.cpp and the exact same model, I was seeing it generate whole sentences in a second.

@ACupofAir
Copy link
Contributor

We use the same method to use openwebui as the front end to deploy the llama3.1-8b model. The text generation speed of the web interface is very fast, and we can respond immediately after sending questions. The specific values ​​of the test show that on one card it is about 40t/s, while on two cards it can stably reach more than 45t/s. For the 128-length prompt, the test results are as follows:

  • one card
    image
  • two cards
    image

@HumerousGorgon
Copy link
Author

Okay, something very wrong is happening with my vLLM instance, because I am seeing NOWHERE near those numbers. You are getting roughly 10x the performance I am. Could you share your script to start it?

@ACupofAir
Copy link
Contributor

ACupofAir commented Oct 23, 2024

We have updated our openwebui+vllm-serving workflow here. Docker start scipt is here, please update the docker image before start it. Frontend and backend startup scripts are as follows, note change <api-key> to any string and <vllm-host-ip> to your host ipv4:

vLLM Serving start script:

#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct"
served_model_name="Meta-Llama-3.1-8B-Instruct"

export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit fp8 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4000 \
  --max-num-seqs 12 \
  --api-key <api-key> \
  --tensor-parallel-size 2

open-webui start script:

#!/bin/bash
export DOCKER_IMAGE=ghcr.io/open-webui/open-webui:main
export CONTAINER_NAME=open-webui

docker rm -f $CONTAINER_NAME

docker run -itd \
           -p 3000:8080 \
           -e OPENAI_API_KEY=<api-key> \
           -e OPENAI_API_BASE_URL=http://<vllm-host-ip>:8000/v1 \
           -v open-webui:/app/backend/data \
           --name $CONTAINER_NAME \
           --restart always $DOCKER_IMAGE  

@HumerousGorgon
Copy link
Author

Hey there.

I used this script exactly as you had put it, updated my container to the latest version...
3-4 tokens per second for text generation. I am starting to wonder if this is an issue with Docker rather than an issue with vLLM.
I'm going to configure a VM with the GPUs passed through and see if I can fix this issue.

@HumerousGorgon
Copy link
Author

image

As you can see, even with the newest branch, this is just flat out not working the way everyone else's is. I'm using unRAID as my host machine. There is one difference: in order to get my GPU is that I have to pass it two environment variables:
-e OverrideGpuAddressSpace=48, -e NEOReadDebugKeys=1

This is the only fundamental difference here. I'm starting to lose my mind.
I'm also wondering whether my Xeon E5-2695v4 is limiting the GPUs?
I have reBAR enabled and GPU acceleration works in other areas, such as video encoding.

@liu-shaojun
Copy link
Contributor

Hi @HumerousGorgon

We recommend setting up your environment with Ubuntu 22.04 and Kernel 6.5, and installing the Intel-i915-dkms driver for optimal performance. Once you have Ubuntu OS set up, please follow the steps below to install Kernel 6.5:

export VERSION="6.5.0-35"
sudo apt-get install -y linux-headers-$VERSION-generic
sudo apt-get install -y linux-image-$VERSION-generic
sudo apt-get install -y linux-modules-$VERSION-generic # may not be needed
sudo apt-get install -y linux-modules-extra-$VERSION-generic

After installing, you can configure GRUB to use the new kernel by running the following:

sudo sed -i "s/GRUB_DEFAULT=.*/GRUB_DEFAULT=\"1> $(echo $(($(awk -F\' '/menuentry / {print $2}' /boot/grub/grub.cfg \
| grep -no $VERSION | sed 's/:/\n/g' | head -n 1)-2)))\"/" /etc/default/grub

Then, update GRUB and reboot your system:

sudo update-grub
sudo reboot

For detailed instructions on installing the driver, you can follow this link:
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#for-linux-kernel-65

Please feel free to reach out if you have any questions!

@glorysdj
Copy link
Contributor

please also make sure "re-sizeable BAR support" and "above 4G mmio" are enabled.

@HumerousGorgon
Copy link
Author

image

Okay, so I set up an Ubuntu host using 22.04 and the 6.5 Kernel. I also used the i915-dkms driver. I'm now seeing 15 tokens per second, which is 3x the speed of my previous config, but still 3x slower than the numbers reported here.

I'm truly at a loss here and unsure of how to go on.

@liu-shaojun
Copy link
Contributor

It might be related to the CPU/GPU frequency. You can try adjusting the CPU/GPU frequency to see if it has any impact.

For CPU frequency, you can use sudo cpupower frequency-info to check the frequency range, and then set it using sudo cpupower frequency-set -d 3.8GHz. In our case, the output of sudo cpupower frequency-info was:

analyzing CPU 30:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 30
  CPUs which need to have their frequency coordinated by software: 30
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 4.50 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 3.80 GHz and 4.50 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 3.80 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

We set the CPU frequency to 3.8GHz for optimal performance:

sudo cpupower frequency-set -d 3.8GHz

For GPU, you can set the frequency using the following commands:

sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400

Let us know if adjusting the frequencies helps improve the performance!

@HumerousGorgon
Copy link
Author

We use the same method to use openwebui as the front end to deploy the llama3.1-8b model. The text generation speed of the web interface is very fast, and we can respond immediately after sending questions. The specific values ​​of the test show that on one card it is about 40t/s, while on two cards it can stably reach more than 45t/s. For the 128-length prompt, the test results are as follows:

  • one card
    image
  • two cards
    image

So I went ahead and purchased a WHOLE ENTIRE NEW SYSTEM
11600kf, Z590 Vision D, 32GB of 3200MHz C16 RAM, same 2 Arc A770's...
11 tokens per second on text generation!
What am I doing wrong?! 6.5 kernel, rebar enabled, everything down to the letter with what's been shown. Is it the fact that I'm using the straight Llama-3.1 model from Meta, no quantisation? I have no idea anymore....

@HumerousGorgon
Copy link
Author

Also, removing tensor parallel results in 22 tokens per second, so I'm still receiving half of the performance that other group members are experiencing.

@HumerousGorgon
Copy link
Author

And finally, setting cpu frequency to 3.8GHz did for a single second set the output speed to 40 tokens per second, but it then dropped back down to 20 tokens per second

@HumerousGorgon
Copy link
Author

Finally finally, I had changed the governer to performance, still no change.

@ACupofAir
Copy link
Contributor

We have updated our openwebui+vllm-serving workflow here. Docker start scipt is here, please update the docker image before start it. Frontend and backend startup scripts are as follows, note change <api-key> to any string and <vllm-host-ip> to your host ipv4:

vLLM Serving start script:

#!/bin/bash
model="/llm/models/Meta-Llama-3.1-8B-Instruct"
served_model_name="Meta-Llama-3.1-8B-Instruct"

export CCL_WORKER_COUNT=2
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit fp8 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4000 \
  --max-num-seqs 12 \
  --api-key <api-key> \
  --tensor-parallel-size 2

open-webui start script:

#!/bin/bash
export DOCKER_IMAGE=ghcr.io/open-webui/open-webui:main
export CONTAINER_NAME=open-webui

docker rm -f $CONTAINER_NAME

docker run -itd \
           -p 3000:8080 \
           -e OPENAI_API_KEY=<api-key> \
           -e OPENAI_API_BASE_URL=http://<vllm-host-ip>:8000/v1 \
           -v open-webui:/app/backend/data \
           --name $CONTAINER_NAME \
           --restart always $DOCKER_IMAGE  

After starting vllm serving according to the above method, you can directly use /llm/vllm_online_benchmark.py in docker for performance testing. You should use this script to measure accurate TPS. If you start serving according to the above script, you only need to enter docker and execute following code python vllm_online_benchmark.py <model> <max_seq> [input_length] [output_length] in the /llm directory. For example, like this:

python vllm_online_benchmark.py Meta-Llama-3.1-8B-Instruct 8 1024 512

What's more, you need make sure the port in vllm_online_benchmark.py in line 427 LLM_URLS = [f"http://localhost:{PORT}/v1/completions" for PORT in [8000]] is the same one in your vllm serving start script, and the served_model_name in the vllm serving start scipt should be same with the model file basename. In your script, you need change the served _model_name to llama-3-1-instruct to make the vllm_online_benchmark.py works.

...
model="/llm/models/llama-3-1-instruct"
served_model_name="llama-3-1-instruct"
...

@glorysdj
Copy link
Contributor

And finally, setting cpu frequency to 3.8GHz did for a single second set the output speed to 40 tokens per second, but it then dropped back down to 20 tokens per second

Have you set the GPU frequency using the following commands:
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400

@HumerousGorgon
Copy link
Author

Is xpu-smi supported for Arc series GPUs?

@liu-shaojun
Copy link
Contributor

Is xpu-smi supported for Arc series GPUs?

Yes, you can follow this guide https://github.com/intel/xpumanager/blob/master/doc/smi_install_guide.md to install xpu-smi.

@HumerousGorgon
Copy link
Author

Hello everyone!

Okay.. after reinstalling with Ubuntu 22.04.4, installing Kernel 6.5, installing the DKMS drivers, manually changing the CPU and GPU frequencies.. I'm finally getting the correct performance. Using a single GPU I get 40-42 tokens per second and dual gpu I get 38-41 tokens per second. The drop in performance with the second GPU is likely just a link speed difference between them, which is nothing.

Thank you everyone for your help during this. I appreciate your patience and extended support!

@hkvision
Copy link
Contributor

Closing this issue. Feel free to tell us if you have further problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants