Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncontrolled memory growth in custom backend #9576

Open
5 tasks done
sappjw opened this issue Oct 3, 2024 · 5 comments
Open
5 tasks done

Uncontrolled memory growth in custom backend #9576

sappjw opened this issue Oct 3, 2024 · 5 comments

Comments

@sappjw
Copy link

sappjw commented Oct 3, 2024

What happened?

I wrote a custom backend. I'm using it to open a file, operate on the data, remove most of it from the Dataset using .isel, open the next, concatenate, and repeat. I noticed the memory used by the system grew significantly over time even though the size of the Dataset was approximately the same. I was able to reproduce the problem without most of this complexity.

I repeatedly created a dummy Dataset with 25 Variables and observed the number of objects with objgraph after each object creation. I see Variable instances continually increasing, even though I have del'd the Dataset after creating it. I think this suggests that something in xarray is not releasing the Dataset.

Variable               75       +25
ReferenceType        8511        +4
lock                   46        +4
dict                21283        +2
KeyedRef               22        +2
SerializableLock        9        +2
list                23256        +1
set                  1873        +1
method               1893        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        8515        +4
lock                   50        +4
dict                21285        +2
KeyedRef               24        +2
SerializableLock       11        +2
list                23257        +1
set                  1874        +1
method               1894        +1
SeedSequence            5        +1

I picked a random Variable that was not released and printed the reference chain graph.

import random
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Variable')), objgraph.is_proper_module), filename='graph.png')

graph_variable

What did you expect to happen?

I expected the memory used for the Dataset to be released and garbage-collected. I expected the memory in use to plateau instead of grow.

Minimal Complete Verifiable Example

import os
os.system("mamba install --yes objgraph")

import io

import numpy as np
import xarray as xr
from xarray.backends import BackendArray, BackendEntrypoint
from xarray.backends.common import AbstractDataStore
from xarray.backends.file_manager import CachingFileManager
from xarray.backends.locks import SerializableLock, ensure_lock
from xarray.backends.store import StoreBackendEntrypoint
from xarray.core.indexing import (
    IndexingSupport,
    explicit_indexing_adapter,
)
from xarray.core.utils import Frozen, close_on_error


# This backend creates a random 500MB variable and stores it into a Dataset 25 times
class memtest_DataStore(AbstractDataStore):
    """Store for binary data"""

    def __init__(
        self,
        filename,
        lock=None,
        num_bytes=500_000_000,
        **kwargs,
    ):
        if lock is None:
            lock = SerializableLock()
        self.lock = ensure_lock(lock)
        # Create a lock for the file manager
        self._fm_lock = SerializableLock()
        self._manager = CachingFileManager(
            self._open_file,
            filename,
            lock=self._fm_lock,
            kwargs=kwargs,
        )

        self.num_bytes = num_bytes
        self._num_vars = 25
        self._rng = np.random.default_rng()
        self._xr_obj = None

    def _open_file(self, filename, **kwargs):
        b = self._rng.random(self.num_bytes // 8, dtype=np.float64)
        file_obj = io.BytesIO(b, **kwargs)

        return file_obj

    def _create_dataset(self):
        with self._manager.acquire_context(needs_lock=True) as file_obj:
            file_obj.seek(0)
            fltarr = np.frombuffer(file_obj.getvalue(), np.float64)

        xr_obj = xr.Dataset()
        # Create coordinates
        xr_obj["index"] = (
            ("index",),
            range(len(fltarr)),
        )
        # Create several variables so we can see the change
        for i in range(self._num_vars):
            xr_obj[f"var{i}"] = (
                ("index",),
                fltarr,
            )

        return xr_obj

    @property
    def ds(self):
        if not self._xr_obj:
            self._xr_obj = self._create_dataset()

        return self._xr_obj

    def open_store_variable(self, name, var):
        return self.ds[name].variable

    def get_variables(self):
        return Frozen(self.ds.variables)

    def get_attrs(self):
        return Frozen(self.ds.attrs)

    def get_dimensions(self):
        return Frozen(self.ds.dims)

    def close(self):
        self._manager.close()


class memtest_backend(BackendEntrypoint):
    description = "Memory test backend"

    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
        lock=None,
        num_bytes=500_000_000,
    ):
        store = memtest_DataStore(
            filename_or_obj,
            lock=lock,
            num_bytes=num_bytes,
        )

        store_entrypoint = StoreBackendEntrypoint()
        with close_on_error(store):
            ds = store_entrypoint.open_dataset(
                store,
                drop_variables=drop_variables,
            )

        return ds

# Loop to test memory growth
# This results in about 10GB of additional memory used by the end of the process
import xarray as xr
import objgraph

objgraph.show_growth(limit=1)
for i in range(10):
    ds = xr.open_dataset(i, engine=memtest_backend)
    # Just to test that something's still holding on to the data
    del ds
    objgraph.show_growth()
    print('---')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

function    18610    +18610
list                 15403    +12798
function             25617     +7007
dict                 13365     +6005
tuple                13360     +5284
CompositeUnit         2954     +2954
PrefixUnit            1526     +1526
ReferenceType         4284     +1181
cell                  4966     +1112
type                  2397      +792
getset_descriptor     3374      +628
---
Variable                         50       +25
ReferenceType                  4298       +14
builtin_function_or_method     2537       +10
lock                             34        +4
dict                          13367        +2
KeyedRef                         19        +2
SerializableLock                  7        +2
list                          15404        +1
set                             952        +1
method                          369        +1
---
Variable               75       +25
ReferenceType        4302        +4
lock                   38        +4
dict                13369        +2
KeyedRef               21        +2
SerializableLock        9        +2
list                15405        +1
set                   953        +1
method                370        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        4306        +4
lock                   42        +4
dict                13371        +2
KeyedRef               23        +2
SerializableLock       11        +2
list                15406        +1
set                   954        +1
method                371        +1
SeedSequence            5        +1
---
Variable              125       +25
ReferenceType        4310        +4
lock                   46        +4
dict                13373        +2
KeyedRef               25        +2
SerializableLock       13        +2
list                15407        +1
set                   955        +1
method                372        +1
SeedSequence            6        +1
---
Variable              150       +25
ReferenceType        4314        +4
lock                   50        +4
dict                13375        +2
KeyedRef               27        +2
SerializableLock       15        +2
list                15408        +1
set                   956        +1
method                373        +1
SeedSequence            7        +1
---
Variable              175       +25
ReferenceType        4318        +4
lock                   54        +4
dict                13377        +2
KeyedRef               29        +2
SerializableLock       17        +2
list                15409        +1
set                   957        +1
method                374        +1
SeedSequence            8        +1
---
Variable              200       +25
ReferenceType        4322        +4
lock                   58        +4
dict                13379        +2
KeyedRef               31        +2
SerializableLock       19        +2
list                15410        +1
set                   958        +1
method                375        +1
SeedSequence            9        +1
---
Variable              225       +25
ReferenceType        4326        +4
lock                   62        +4
dict                13381        +2
KeyedRef               33        +2
SerializableLock       21        +2
list                15411        +1
set                   959        +1
method                376        +1
SeedSequence           10        +1
---
Variable              250       +25
ReferenceType        4330        +4
lock                   66        +4
dict                13383        +2
KeyedRef               35        +2
SerializableLock       23        +2
list                15412        +1
set                   960        +1
method                377        +1
SeedSequence           11        +1
---

Anything else we need to know?

This crashes the Binder notebook instance since it uses so much memory.

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 5.14.0-505.el9.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.3
numpy: 1.26.4
scipy: None
netCDF4: 1.7.1
pydap: None
h5netcdf: 1.2.0
h5py: 3.11.0
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: 2023.9.3
distributed: 2023.9.3
matplotlib: 3.9.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.9.2
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.2.1
conda: None
pytest: None
mypy: None
IPython: 8.16.1
sphinx: None

@sappjw sappjw added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 3, 2024
@sappjw sappjw changed the title Memory growth in custom backend Uncontrolled memory growth in custom backend Oct 3, 2024
@max-sixty
Copy link
Collaborator

That does look like a well-formed issue, thanks @sappjw .

I had a look through the code for 10 minutes as I'm not familiar with it. It looks like we maintain a REF_COUNTS dict which increments, and I think is supposed to decrement on the file closing.

To the extent you want to explore more — it would be interesting to see whether that is incrementing-and-not-decrementing as you open the file on each loop. And then why it's not decrementing — does the file need to have .close called rather than just going out of scope...

@sappjw
Copy link
Author

sappjw commented Oct 3, 2024

Thanks. Inserting a ds.close() before the del ds doesn't change the results (I did have to update the code above slightly to read file_obj.getvalue() instead of file_obj.getbuffer()).

@max-sixty
Copy link
Collaborator

OK. Would be interesting to see REF_COUNTS / feel free to explore more on what might be holding the reference...

@TomNicholas TomNicholas added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 3, 2024
@sappjw
Copy link
Author

sappjw commented Oct 4, 2024

If I put a del self._manager in the close() method of memtest_DataStore, the memory growth goes away. Do you think this is a good way to handle the issue? I tried to model this off of the scipy backend and it doesn't free _manager.

@max-sixty
Copy link
Collaborator

Very possibly, but I'm unfortunately a long way from the expert on this code, hence my rather basic debugging so far.

Others will hopefully know more. If that suggestion helps and you can put a small PR + test together, that will very likely get some good traction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants