GPU not being utilized on Windows #3806

gsuuon · 2023-10-27T00:30:39Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

GPU usage goes up with -ngl and decent inference performance. Expect to see around 170 ms/tok.

Current Behavior

GPU memory usage goes up but activity stays at 0, only CPU usage increases. Getting around 2500 ms/tok.

Environment and Context

Windows 11 - 3070 RTX

Attempting to run codellama-13b-instruct.Q6_K.gguf

I ran a git bisect which showed 017efe899d8 as the first bad commit. I see about a 10x drop in performance between ff5a3f0 and 017efe8, from 170ms/tok to 2500ms/tok.

Steps to Reproduce

build with cmake .. -DLLAMA_CUBLAS=ON
run .\bin\Release\server.exe -m ..\models\codellama-13b-instruct.Q6_K.gguf -c 4096 -ngl 24

The text was updated successfully, but these errors were encountered:

atonalfreerider · 2023-10-27T14:38:17Z

EDIT: I have tested all the way back to https://github.com/ggerganov/llama.cpp/releases/tag/b1116 (August 29) and I'm experiencing the same behavior. I'm running main.exe with -m <path-to-model.gguf> and -p <prompt here> flags

I am also experiencing the same behavior from the latest build release https://github.com/ggerganov/llama.cpp/releases/tag/b1429

Low, to zero GPU utilization with CUDA 12.2 on an RTX 2070, and zero utilization on a GTX 1070
More details here:
SciSharp/LLamaSharp#189

benchmark here:

C:\Users\johnb\Desktop\llamacpp-bin\llama-b1429-bin-win-cublas-cu12.2.0-x64>llama-bench.exe
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | pp 512     |  1165.85 ± 60.52 |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 128     |     76.31 ± 0.11 |

atonalfreerider · 2023-10-27T19:02:28Z

I may have found a partial answer. First I updated my NVIDIA driver. When I ran a prompt, I immediately noticed that the dedicated GPU memory filled up almost to the max. My RAM was maxed out per usual.

I ran a 14GB model, and a 4GB model. It would appear that the 4GB model fits into the GPU memory, but clearly not the 14GB model, which then offloads to RAM. As a result, I got faster outputs, with up to 30% GPU utilization. Diagram with results below:

AsakusaRinne · 2023-10-28T02:58:37Z

So the key is that when n_gpu_layers is to large but no error was thrown, instead, gpu is totally ignored?

atonalfreerider · 2023-10-28T04:42:47Z

So the key is that when n_gpu_layers is to large but no error was thrown, instead, gpu is totally ignored?

Yes I've reproduced this behavior on multiple machines. It appears that overflowing the GPU memory causes 100% of the activity to shift to the CPU.

slaren · 2023-10-28T11:25:32Z

This is a feature of the NVIDIA drivers under Windows, they allow allocating more memory than available. As far as I know there is no way to disable this. Linux should not be affected.

slaren · 2023-10-31T14:17:34Z

Looks like this can be disabled now:
https://nvidia.custhelp.com/app/answers/detail/a_id/5490
https://old.reddit.com/r/LocalLLaMA/comments/17kl8gu/psa_with_nvidia_driver_56401_its_now_possible_to/

gsuuon · 2023-10-31T23:40:00Z

@atonalfreerider I think these are separate issues - I get poor performance even with -ngl set to a low value with only 4.2 of 8.0 GB allocated. I only see the issue from 017efe8 onwards, if I go to 017efe899d8^ (ff5a3f0) I see GPU being utilized with good performance.

These are all with .\bin\Release\server.exe -m ..\models\codellama-13b-instruct.Q6_K.gguf -c 4096

017efe8 -ngl 10

llama_print_timings:      sample time =    19.18 ms /    21 runs   (    0.91 ms per token,  1095.18 tokens per second)
llama_print_timings: prompt eval time =  1507.32 ms /    33 tokens (   45.68 ms per token,    21.89 tokens per second)
llama_print_timings:        eval time = 66379.60 ms /    20 runs   ( 3318.98 ms per token,     0.30 tokens per second)

017efe8 -ngl 22

llama_print_timings:      sample time =    57.21 ms /    61 runs   (    0.94 ms per token,  1066.30 tokens per second)
llama_print_timings: prompt eval time =  1183.16 ms /    33 tokens (   35.85 ms per token,    27.89 tokens per second)
llama_print_timings:        eval time = 127016.99 ms /    60 runs   ( 2116.95 ms per token,     0.47 tokens per second)

ff5a3f0 -ngl 22

llama_print_timings:      sample time =    48.12 ms /    53 runs   (    0.91 ms per token,  1101.41 tokens per second)
llama_print_timings: prompt eval time =  1212.03 ms /    33 tokens (   36.73 ms per token,    27.23 tokens per second)
llama_print_timings:        eval time =  8715.19 ms /    52 runs   (  167.60 ms per token,     5.97 tokens per second)

There's spiky activity at 017efe8 but is steadily in single digits, whereas ff5a3f0 usage stays around 40%

gsuuon · 2023-11-02T22:11:34Z

It looks like this issue is specifically with LLAMA_NATIVE, I get normal performance again with -DLLAMA_NATIVE=OFF. Should that be the default?

gsuuon · 2023-11-02T22:21:30Z

Resolved by #3906. Not sure what's going on with low GPU usage, maybe CPU was just bottlenecking.

gsuuon added the bug Something isn't working label Oct 27, 2023

atonalfreerider mentioned this issue Oct 27, 2023

Running LLamaSharp on gpu SciSharp/LLamaSharp#189

Closed

atonalfreerider mentioned this issue Oct 31, 2023

v0.6.0 significantly reduced performance SciSharp/LLamaSharp#225

Closed

gsuuon closed this as completed Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU not being utilized on Windows #3806

GPU not being utilized on Windows #3806

gsuuon commented Oct 27, 2023

atonalfreerider commented Oct 27, 2023 •

edited

Loading

atonalfreerider commented Oct 27, 2023

AsakusaRinne commented Oct 28, 2023

atonalfreerider commented Oct 28, 2023

slaren commented Oct 28, 2023

slaren commented Oct 31, 2023

gsuuon commented Oct 31, 2023

gsuuon commented Nov 2, 2023

gsuuon commented Nov 2, 2023

GPU not being utilized on Windows #3806

GPU not being utilized on Windows #3806

Comments

gsuuon commented Oct 27, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

atonalfreerider commented Oct 27, 2023 • edited Loading

atonalfreerider commented Oct 27, 2023

AsakusaRinne commented Oct 28, 2023

atonalfreerider commented Oct 28, 2023

slaren commented Oct 28, 2023

slaren commented Oct 31, 2023

gsuuon commented Oct 31, 2023

gsuuon commented Nov 2, 2023

gsuuon commented Nov 2, 2023

atonalfreerider commented Oct 27, 2023 •

edited

Loading