Memory leak in local laplacian filter Manual scheduling #8399

mshafiei · 2024-08-27T02:24:19Z

Hi,

I'm observing that the GPU runs out of memory when I call local laplacian filter in a loop. It's reproducible by the following code snippet. When I only enable Mullapudi2016 and disable Manual, I do not observe this issue anymore.


from local_laplacian import local_laplacian
from local_laplacian_Mullapudi2016 import local_laplacian_Mullapudi2016
import halide.imageio
import numpy as np
import sys
import timeit
import halide as hl


def llf(input_path, levels, alpha, beta, output_path):
    timing_iterations = 10

    print("Reading from %s ..." % input_path)
    input_buf_u8 = halide.imageio.imread(input_path)
    assert input_buf_u8.dtype == np.uint8
    # Convert to uint16 in range [0..1]
    input_buf = input_buf_u8.astype(np.uint16) * 257
    h = input_buf.shape[1]
    w = input_buf.shape[2]
    output_buf = np.empty([3, h, w], dtype=input_buf.dtype)
    tests = {
        "Manual": local_laplacian
        # "Mullapudi2016": local_laplacian_Mullapudi2016,
    }

    for name, fn in tests.items():
        print("Running %s... " % name, end="")
        t = timeit.Timer(lambda: fn(input_buf, levels, alpha / (levels - 1), beta, output_buf))
        avg_time_sec = t.timeit(number=timing_iterations) / timing_iterations
        print("time: %fms" % (avg_time_sec * 1e3))

    output_buf_u8 = (output_buf // 257).astype(np.uint8)

    print("Saving to %s ..." % output_path)
    halide.imageio.imwrite(output_path, output_buf_u8)


def main():

    input_path = sys.argv[1]
    levels = int(sys.argv[2])
    alpha = float(sys.argv[3])
    beta = float(sys.argv[4])
    output_path = sys.argv[5]

    for i in range(100):
        llf(input_path, levels, alpha, beta, output_path)

    print("Success!")
    sys.exit(0)

if __name__ == "__main__":
    main()

The text was updated successfully, but these errors were encountered:

abadams · 2024-09-04T19:53:17Z

@shoaibkamil

shoaibkamil · 2024-09-10T18:13:51Z

I can repro this behavior running on macOS with Metal. Investigating.

mshafiei · 2024-09-13T18:15:59Z

It's also happening for blur app and bilateral grid. Is the root cause in generator compilation step?
Other pieces of information that might be helpful: I'm using host-cuda-profile argument in add_halide_library. to enable GPU scheduling on RTX 3070 with nvidia driver 535.183.01 and CUDA 12.1.

abadams · 2024-10-16T17:14:47Z

It looks like the generated extension code makes no attempt to free any gpu allocations made by the pipeline. It does set host dirty and copies back to host though, so I'm not sure what the intention was here. @steven-johnson is this just an oversight? Should the PyHalideBuffer destructor be calling device_free?

template<int dimensions>
struct PyHalideBuffer {
    // Must allocate at least 1, even if d=0
    static constexpr int dims_to_allocate = (dimensions < 1) ? 1 : dimensions;

    Py_buffer py_buf;
    halide_dimension_t halide_dim[dims_to_allocate];
    halide_buffer_t halide_buf;
    bool py_buf_needs_release = false;

    bool unpack(PyObject *py_obj, int py_getbuffer_flags, const char *name) {
        return Halide::PythonRuntime::unpack_buffer(py_obj, py_getbuffer_flags, name, dimensions, py_buf, halide_dim, halide_buf, py_buf_needs_release);
    }

    ~PyHalideBuffer() {
        if (py_buf_needs_release) {
            PyBuffer_Release(&py_buf);
        }
    }

    PyHalideBuffer() = default;
    PyHalideBuffer(const PyHalideBuffer &other) = delete;
    PyHalideBuffer &operator=(const PyHalideBuffer &other) = delete;
    PyHalideBuffer(PyHalideBuffer &&other) = delete;
    PyHalideBuffer &operator=(PyHalideBuffer &&other) = delete;
};

}  // namespace

namespace Halide::PythonExtensions {

namespace {

const char* const local_laplacian_kwlist[] = {
  "input",
  "levels",
  "alpha",
  "beta",
  "output",
  nullptr
};

}  // namespace

// local_laplacian
PyObject *local_laplacian(PyObject *module, PyObject *args, PyObject *kwargs) {
  PyObject* py_input;
  int py_levels;
  float py_alpha;
  float py_beta;
  PyObject* py_output;
  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "OiffO", (char**)local_laplacian_kwlist
    , &py_input
    , &py_levels
    , &py_alpha
    , &py_beta
    , &py_output
  )) {
    PyErr_Format(PyExc_ValueError, "Internal error");
    return nullptr;
  }
  PyHalideBuffer<3> b_input;
  PyHalideBuffer<3> b_output;
  if (!b_input.unpack(py_input, 0, local_laplacian_kwlist[0])) return nullptr;
  if (!b_output.unpack(py_output, PyBUF_WRITABLE, local_laplacian_kwlist[4])) return nullptr;

  b_input.halide_buf.set_host_dirty();
  int result;
  Py_BEGIN_ALLOW_THREADS
  result = local_laplacian(
    &b_input.halide_buf,
    py_levels,
    py_alpha,
    py_beta,
    &b_output.halide_buf
  );
  Py_END_ALLOW_THREADS
  if (result == 0) result = halide_copy_to_host(nullptr, &b_output.halide_buf);
  if (result != 0) {
    #ifndef HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    PyErr_Format(PyExc_RuntimeError, "Halide Runtime Error: %d", result);
    #else
    PyErr_Format(PyExc_ValueError, "Halide error %d", result);
    #endif  // HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    return nullptr;
  }

  Py_INCREF(Py_None);
  return Py_None;
}

steven-johnson · 2024-10-16T17:38:12Z

Should the PyHalideBuffer destructor be calling device_free?

If we do that, don't we risk freeing a device allocation that might be in use by a shared buffer allocation (e.g. via device_crop or similar)? Is it possible that we just don't free all the PyHalideBuffers?

abadams · 2024-10-16T17:59:31Z

It looks like the halide_buffer_t is being created right there from a numpy array, so I don't think it's possible that anything aliases with it. Or is it possible to pass some sort of wrapper of Halide::Runtime::Buffer?

steven-johnson · 2024-10-16T18:03:40Z

OK, I will take a look

steven-johnson · 2024-10-16T18:10:02Z

OK, yeah, I think an explicit call to halide_device_free() is likely needed in the dtor to PyHalideBuffer, let me do some testing first

steven-johnson · 2024-10-16T18:42:45Z

I think #8439 is what we need, please give it a try

…#8439) Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

mshafiei changed the title ~~Out of memory error when running generators in a loop~~ Memory leak in local laplacian filter Manual scheduling Aug 27, 2024

shoaibkamil self-assigned this Sep 9, 2024

steven-johnson added a commit that referenced this issue Oct 16, 2024

PythonExtensionGen: ~PyHalideBuffer should call device_free() (#8399)

bdf0f8f

steven-johnson added a commit that referenced this issue Oct 21, 2024

PythonExtensionGen: ~PyHalideBuffer should call device_free() (#8399) (…

1653c16

…#8439) Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in local laplacian filter Manual scheduling #8399

Memory leak in local laplacian filter Manual scheduling #8399

mshafiei commented Aug 27, 2024 •

edited

Loading

abadams commented Sep 4, 2024

shoaibkamil commented Sep 10, 2024

mshafiei commented Sep 13, 2024

abadams commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

abadams commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

Memory leak in local laplacian filter Manual scheduling #8399

Memory leak in local laplacian filter Manual scheduling #8399

Comments

mshafiei commented Aug 27, 2024 • edited Loading

abadams commented Sep 4, 2024

shoaibkamil commented Sep 10, 2024

mshafiei commented Sep 13, 2024

abadams commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

abadams commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

steven-johnson commented Oct 16, 2024

mshafiei commented Aug 27, 2024 •

edited

Loading