Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in local laplacian filter Manual scheduling #8399

Open
mshafiei opened this issue Aug 27, 2024 · 9 comments
Open

Memory leak in local laplacian filter Manual scheduling #8399

mshafiei opened this issue Aug 27, 2024 · 9 comments
Assignees

Comments

@mshafiei
Copy link

mshafiei commented Aug 27, 2024

Hi,

I'm observing that the GPU runs out of memory when I call local laplacian filter in a loop. It's reproducible by the following code snippet. When I only enable Mullapudi2016 and disable Manual, I do not observe this issue anymore.


from local_laplacian import local_laplacian
from local_laplacian_Mullapudi2016 import local_laplacian_Mullapudi2016
import halide.imageio
import numpy as np
import sys
import timeit
import halide as hl


def llf(input_path, levels, alpha, beta, output_path):
    timing_iterations = 10

    print("Reading from %s ..." % input_path)
    input_buf_u8 = halide.imageio.imread(input_path)
    assert input_buf_u8.dtype == np.uint8
    # Convert to uint16 in range [0..1]
    input_buf = input_buf_u8.astype(np.uint16) * 257
    h = input_buf.shape[1]
    w = input_buf.shape[2]
    output_buf = np.empty([3, h, w], dtype=input_buf.dtype)
    tests = {
        "Manual": local_laplacian
        # "Mullapudi2016": local_laplacian_Mullapudi2016,
    }

    for name, fn in tests.items():
        print("Running %s... " % name, end="")
        t = timeit.Timer(lambda: fn(input_buf, levels, alpha / (levels - 1), beta, output_buf))
        avg_time_sec = t.timeit(number=timing_iterations) / timing_iterations
        print("time: %fms" % (avg_time_sec * 1e3))

    output_buf_u8 = (output_buf // 257).astype(np.uint8)

    print("Saving to %s ..." % output_path)
    halide.imageio.imwrite(output_path, output_buf_u8)


def main():

    input_path = sys.argv[1]
    levels = int(sys.argv[2])
    alpha = float(sys.argv[3])
    beta = float(sys.argv[4])
    output_path = sys.argv[5]

    for i in range(100):
        llf(input_path, levels, alpha, beta, output_path)

    print("Success!")
    sys.exit(0)

if __name__ == "__main__":
    main()

@mshafiei mshafiei changed the title Out of memory error when running generators in a loop Memory leak in local laplacian filter Manual scheduling Aug 27, 2024
@abadams
Copy link
Member

abadams commented Sep 4, 2024

@shoaibkamil

@shoaibkamil shoaibkamil self-assigned this Sep 9, 2024
@shoaibkamil
Copy link
Contributor

I can repro this behavior running on macOS with Metal. Investigating.

@mshafiei
Copy link
Author

It's also happening for blur app and bilateral grid. Is the root cause in generator compilation step?
Other pieces of information that might be helpful: I'm using host-cuda-profile argument in add_halide_library. to enable GPU scheduling on RTX 3070 with nvidia driver 535.183.01 and CUDA 12.1.

@abadams
Copy link
Member

abadams commented Oct 16, 2024

It looks like the generated extension code makes no attempt to free any gpu allocations made by the pipeline. It does set host dirty and copies back to host though, so I'm not sure what the intention was here. @steven-johnson is this just an oversight? Should the PyHalideBuffer destructor be calling device_free?

template<int dimensions>
struct PyHalideBuffer {
    // Must allocate at least 1, even if d=0
    static constexpr int dims_to_allocate = (dimensions < 1) ? 1 : dimensions;

    Py_buffer py_buf;
    halide_dimension_t halide_dim[dims_to_allocate];
    halide_buffer_t halide_buf;
    bool py_buf_needs_release = false;

    bool unpack(PyObject *py_obj, int py_getbuffer_flags, const char *name) {
        return Halide::PythonRuntime::unpack_buffer(py_obj, py_getbuffer_flags, name, dimensions, py_buf, halide_dim, halide_buf, py_buf_needs_release);
    }

    ~PyHalideBuffer() {
        if (py_buf_needs_release) {
            PyBuffer_Release(&py_buf);
        }
    }

    PyHalideBuffer() = default;
    PyHalideBuffer(const PyHalideBuffer &other) = delete;
    PyHalideBuffer &operator=(const PyHalideBuffer &other) = delete;
    PyHalideBuffer(PyHalideBuffer &&other) = delete;
    PyHalideBuffer &operator=(PyHalideBuffer &&other) = delete;
};

}  // namespace

namespace Halide::PythonExtensions {

namespace {

const char* const local_laplacian_kwlist[] = {
  "input",
  "levels",
  "alpha",
  "beta",
  "output",
  nullptr
};

}  // namespace

// local_laplacian
PyObject *local_laplacian(PyObject *module, PyObject *args, PyObject *kwargs) {
  PyObject* py_input;
  int py_levels;
  float py_alpha;
  float py_beta;
  PyObject* py_output;
  if (!PyArg_ParseTupleAndKeywords(args, kwargs, "OiffO", (char**)local_laplacian_kwlist
    , &py_input
    , &py_levels
    , &py_alpha
    , &py_beta
    , &py_output
  )) {
    PyErr_Format(PyExc_ValueError, "Internal error");
    return nullptr;
  }
  PyHalideBuffer<3> b_input;
  PyHalideBuffer<3> b_output;
  if (!b_input.unpack(py_input, 0, local_laplacian_kwlist[0])) return nullptr;
  if (!b_output.unpack(py_output, PyBUF_WRITABLE, local_laplacian_kwlist[4])) return nullptr;

  b_input.halide_buf.set_host_dirty();
  int result;
  Py_BEGIN_ALLOW_THREADS
  result = local_laplacian(
    &b_input.halide_buf,
    py_levels,
    py_alpha,
    py_beta,
    &b_output.halide_buf
  );
  Py_END_ALLOW_THREADS
  if (result == 0) result = halide_copy_to_host(nullptr, &b_output.halide_buf);
  if (result != 0) {
    #ifndef HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    PyErr_Format(PyExc_RuntimeError, "Halide Runtime Error: %d", result);
    #else
    PyErr_Format(PyExc_ValueError, "Halide error %d", result);
    #endif  // HALIDE_PYTHON_EXTENSION_OMIT_ERROR_AND_PRINT_HANDLERS
    return nullptr;
  }

  Py_INCREF(Py_None);
  return Py_None;
}

@steven-johnson
Copy link
Contributor

Should the PyHalideBuffer destructor be calling device_free?

If we do that, don't we risk freeing a device allocation that might be in use by a shared buffer allocation (e.g. via device_crop or similar)? Is it possible that we just don't free all the PyHalideBuffers?

@abadams
Copy link
Member

abadams commented Oct 16, 2024

It looks like the halide_buffer_t is being created right there from a numpy array, so I don't think it's possible that anything aliases with it. Or is it possible to pass some sort of wrapper of Halide::Runtime::Buffer?

@steven-johnson
Copy link
Contributor

OK, I will take a look

@steven-johnson
Copy link
Contributor

OK, yeah, I think an explicit call to halide_device_free() is likely needed in the dtor to PyHalideBuffer, let me do some testing first

@steven-johnson
Copy link
Contributor

I think #8439 is what we need, please give it a try

steven-johnson added a commit that referenced this issue Oct 21, 2024
…#8439)

Co-authored-by: Andrew Adams <andrew.b.adams@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@mshafiei @steven-johnson @shoaibkamil @abadams and others