Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU not being utilized on Windows #3806

Closed
4 tasks done
gsuuon opened this issue Oct 27, 2023 · 9 comments
Closed
4 tasks done

GPU not being utilized on Windows #3806

gsuuon opened this issue Oct 27, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@gsuuon
Copy link

gsuuon commented Oct 27, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

GPU usage goes up with -ngl and decent inference performance. Expect to see around 170 ms/tok.

Current Behavior

GPU memory usage goes up but activity stays at 0, only CPU usage increases. Getting around 2500 ms/tok.

Environment and Context

Windows 11 - 3070 RTX

Attempting to run codellama-13b-instruct.Q6_K.gguf

I ran a git bisect which showed 017efe899d8 as the first bad commit. I see about a 10x drop in performance between ff5a3f0 and 017efe8, from 170ms/tok to 2500ms/tok.

Steps to Reproduce

  1. build with cmake .. -DLLAMA_CUBLAS=ON
  2. run .\bin\Release\server.exe -m ..\models\codellama-13b-instruct.Q6_K.gguf -c 4096 -ngl 24
@gsuuon gsuuon added the bug Something isn't working label Oct 27, 2023
@atonalfreerider
Copy link

atonalfreerider commented Oct 27, 2023

EDIT: I have tested all the way back to https://github.com/ggerganov/llama.cpp/releases/tag/b1116 (August 29) and I'm experiencing the same behavior. I'm running main.exe with -m <path-to-model.gguf> and -p <prompt here> flags


I am also experiencing the same behavior from the latest build release https://github.com/ggerganov/llama.cpp/releases/tag/b1429

Low, to zero GPU utilization with CUDA 12.2 on an RTX 2070, and zero utilization on a GTX 1070
More details here:
SciSharp/LLamaSharp#189

benchmark here:

C:\Users\johnb\Desktop\llamacpp-bin\llama-b1429-bin-win-cublas-cu12.2.0-x64>llama-bench.exe
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | pp 512     |  1165.85 ± 60.52 |
| llama 7B mostly Q4_0           |   3.56 GiB |     6.74 B | CUDA       |  99 | tg 128     |     76.31 ± 0.11 |

@atonalfreerider
Copy link

I may have found a partial answer. First I updated my NVIDIA driver. When I ran a prompt, I immediately noticed that the dedicated GPU memory filled up almost to the max. My RAM was maxed out per usual.

I ran a 14GB model, and a 4GB model. It would appear that the 4GB model fits into the GPU memory, but clearly not the 14GB model, which then offloads to RAM. As a result, I got faster outputs, with up to 30% GPU utilization. Diagram with results below:

perf2 copy

@AsakusaRinne
Copy link
Contributor

So the key is that when n_gpu_layers is to large but no error was thrown, instead, gpu is totally ignored?

@atonalfreerider
Copy link

So the key is that when n_gpu_layers is to large but no error was thrown, instead, gpu is totally ignored?

Yes I've reproduced this behavior on multiple machines. It appears that overflowing the GPU memory causes 100% of the activity to shift to the CPU.

@slaren
Copy link
Collaborator

slaren commented Oct 28, 2023

This is a feature of the NVIDIA drivers under Windows, they allow allocating more memory than available. As far as I know there is no way to disable this. Linux should not be affected.

@slaren
Copy link
Collaborator

slaren commented Oct 31, 2023

Looks like this can be disabled now:
https://nvidia.custhelp.com/app/answers/detail/a_id/5490
https://old.reddit.com/r/LocalLLaMA/comments/17kl8gu/psa_with_nvidia_driver_56401_its_now_possible_to/

@gsuuon
Copy link
Author

gsuuon commented Oct 31, 2023

@atonalfreerider I think these are separate issues - I get poor performance even with -ngl set to a low value with only 4.2 of 8.0 GB allocated. I only see the issue from 017efe8 onwards, if I go to 017efe899d8^ (ff5a3f0) I see GPU being utilized with good performance.

These are all with .\bin\Release\server.exe -m ..\models\codellama-13b-instruct.Q6_K.gguf -c 4096

017efe8 -ngl 10

llama_print_timings:      sample time =    19.18 ms /    21 runs   (    0.91 ms per token,  1095.18 tokens per second)
llama_print_timings: prompt eval time =  1507.32 ms /    33 tokens (   45.68 ms per token,    21.89 tokens per second)
llama_print_timings:        eval time = 66379.60 ms /    20 runs   ( 3318.98 ms per token,     0.30 tokens per second)

image

017efe8 -ngl 22

llama_print_timings:      sample time =    57.21 ms /    61 runs   (    0.94 ms per token,  1066.30 tokens per second)
llama_print_timings: prompt eval time =  1183.16 ms /    33 tokens (   35.85 ms per token,    27.89 tokens per second)
llama_print_timings:        eval time = 127016.99 ms /    60 runs   ( 2116.95 ms per token,     0.47 tokens per second)

image

ff5a3f0 -ngl 22

llama_print_timings:      sample time =    48.12 ms /    53 runs   (    0.91 ms per token,  1101.41 tokens per second)
llama_print_timings: prompt eval time =  1212.03 ms /    33 tokens (   36.73 ms per token,    27.23 tokens per second)
llama_print_timings:        eval time =  8715.19 ms /    52 runs   (  167.60 ms per token,     5.97 tokens per second)

image

There's spiky activity at 017efe8 but is steadily in single digits, whereas ff5a3f0 usage stays around 40%

@gsuuon
Copy link
Author

gsuuon commented Nov 2, 2023

It looks like this issue is specifically with LLAMA_NATIVE, I get normal performance again with -DLLAMA_NATIVE=OFF. Should that be the default?

@gsuuon
Copy link
Author

gsuuon commented Nov 2, 2023

Resolved by #3906. Not sure what's going on with low GPU usage, maybe CPU was just bottlenecking.

@gsuuon gsuuon closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants