The NVML does not return the correct strut for latest CUDA. #48

fostiropoulos · 2023-07-27T00:18:51Z

For my current system

| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |

The processes running on cuda are read incorrectly. There is an additional size_t element when reading the strut that is not documented but when ignored causes the response to be malformed.

The following code illustrates the error:

from collections import namedtuple
from ctypes import *
import time

import pynvml
import ray
from pynvml.nvml import (
    _nvmlGetFunctionPointer,
    _PrintableStructure,
    nvmlDeviceGetComputeRunningProcesses,
)
import torch
import struct

Process = namedtuple(
    "Process", ["pid", "usedGpuMemory", "gpuInstanceId", "computeInstanceId"]
)


def _remote_fn():
    torch.randn(100).to("cuda")
    time.sleep(10)


def run_bug():
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    # this initializes and creates an additional process (for added difficulty)
    a = torch.randn(100).to("cuda")
    for i in range(10):
        (
            ray.remote(
                num_gpus=0.001,
                num_cpus=0.001,
                max_calls=1,
                max_retries=0,
            )(_remote_fn)
            .options(name="x")
            .remote()
        )
    # wait for all processes to be allocated
    time.sleep(4)
    procs = nvmlDeviceGetComputeRunningProcesses(handle)

    class c_nvmlProcessInfo_t(_PrintableStructure):
        _fields_ = [
            ("pid", c_uint),
            ("usedGpuMemory", c_ulonglong),
            ("gpuInstanceId", c_uint),
            ("computeInstanceId", c_uint),
            ("index", c_ssize_t),
        ]
        _fmt_ = {
            "usedGpuMemory": "%d B",
        }

    pynvml.nvml.c_nvmlProcessInfo_t = c_nvmlProcessInfo_t

    procs_fixed = nvmlDeviceGetComputeRunningProcesses(handle)

    def _parse_result(procs):
        return "\n".join(str(p) for p in procs)

    return _parse_result(procs), _parse_result(procs_fixed)


def get_procs_analysis():
    pynvml.nvmlInit()
    handle = pynvml.nvmlDeviceGetHandleByIndex(0)
    fn = _nvmlGetFunctionPointer("nvmlDeviceGetComputeRunningProcesses_v3")
    # Reference
    # https://docs.nvidia.com/deploy/nvml-api/structnvmlProcessInfo__t.html#structnvmlProcessInfo__t
    # although the order differs (e.g. pid, usedGpuMemory, gpuInstanceId, computeInstanceId)
    # NOTE, I am not sure which is which for the last 2 as they are identical on my system
    # and not possible to debug.
    expr = "IQIIn"
    byte_size_proc = struct.calcsize(expr)
    # ORIGINAL
    proc_array = c_ubyte * (byte_size_proc * 100)  # enough for 100 processes
    c_procs = proc_array()

    # make the call again
    c_count = c_uint(100)

    ret = fn(handle, byref(c_count), c_procs)
    return_bytes = bytes(c_procs)

    def _parse_bytes(idx):
        args = struct.unpack(
            expr,
            return_bytes[idx * byte_size_proc : (idx + 1) * byte_size_proc],
        )
        return Process(*args[:-1])

    formatted = []
    for i in range(100):
        p = _parse_bytes(i)
        if p.pid == 0 and p.usedGpuMemory == 0 and p.gpuInstanceId == 0:
            break
        formatted.append(str(p))

    return '\n'.join(formatted)


if __name__ == "__main__":
    procs, procs_fixed = run_bug()
    print(procs)
    print(procs_fixed)
    print(get_procs_analysis())

Output:

{'pid': 1183425, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295}
{'pid': 0, 'usedGpuMemory': 1280123, 'gpuInstanceId': 251658240, 'computeInstanceId': 0}
{'pid': 4294967295, 'usedGpuMemory': 0, 'gpuInstanceId': 1283310, 'computeInstanceId': 0}
{'pid': 251658240, 'usedGpuMemory': None, 'gpuInstanceId': 0, 'computeInstanceId': 0}
{'pid': 1283307, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295}
{'pid': 0, 'usedGpuMemory': 1283309, 'gpuInstanceId': 251658240, 'computeInstanceId': 0}
{'pid': 4294967295, 'usedGpuMemory': 0, 'gpuInstanceId': 1283311, 'computeInstanceId': 0}
{'pid': 251658240, 'usedGpuMemory': None, 'gpuInstanceId': 0, 'computeInstanceId': 0}
{'pid': 1283308, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295}
{'pid': 0, 'usedGpuMemory': 1283306, 'gpuInstanceId': 251658240, 'computeInstanceId': 0}
{'pid': 4294967295, 'usedGpuMemory': 0, 'gpuInstanceId': 1283313, 'computeInstanceId': 0}
{'pid': 251658240, 'usedGpuMemory': None, 'gpuInstanceId': 0, 'computeInstanceId': 0}

Expected:

{'pid': 1183425, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1280123, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283310, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283307, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283309, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283311, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283308, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283306, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283313, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283312, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283315, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}
{'pid': 1283314, 'usedGpuMemory': 251658240, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295, 'index': 0}

fostiropoulos · 2023-08-02T17:17:55Z

Having tried the same code in different machine with different GPUs the error seems to be related to the GPU model. The code that produces a bug is from RTX 2080 with nvlink, while the same code does not produce an error for V100.

wence- · 2023-08-02T21:39:28Z

Thanks. This is due to an inadvertent ABI break in the 535 driver, which will fixed in the next patch release.

fostiropoulos · 2023-08-24T21:42:35Z

Thanks for clarifying. Should there be a check somewhere and an error raised for this particular version or documentation of it?

erikhuck · 2023-10-31T16:49:49Z

@wence- Patch release for what? nvidia? cuda? Or can the pynvml package be updated to fix this?

erikhuck · 2023-10-31T16:52:53Z

this may be the same problem as my issue here: #50

fostiropoulos · 2023-10-31T17:39:25Z

@erikhuck how I solved it was to uninstall driver version 535 and install the preceding release until the newer version is released.

| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |

The NVML does not return the correct strut for latest CUDA.

8eb4b15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The NVML does not return the correct strut for latest CUDA. #48

The NVML does not return the correct strut for latest CUDA. #48

fostiropoulos commented Jul 27, 2023 •

edited

Loading

fostiropoulos commented Aug 2, 2023 •

edited

Loading

wence- commented Aug 2, 2023

fostiropoulos commented Aug 24, 2023

erikhuck commented Oct 31, 2023

erikhuck commented Oct 31, 2023

fostiropoulos commented Oct 31, 2023

The NVML does not return the correct strut for latest CUDA. #48

Are you sure you want to change the base?

The NVML does not return the correct strut for latest CUDA. #48

Conversation

fostiropoulos commented Jul 27, 2023 • edited Loading

Output:

Expected:

fostiropoulos commented Aug 2, 2023 • edited Loading

wence- commented Aug 2, 2023

fostiropoulos commented Aug 24, 2023

erikhuck commented Oct 31, 2023

erikhuck commented Oct 31, 2023

fostiropoulos commented Oct 31, 2023

fostiropoulos commented Jul 27, 2023 •

edited

Loading

fostiropoulos commented Aug 2, 2023 •

edited

Loading