Smaller (in-memory) and lazy edge lists #8

johandahlberg · 2023-09-20T13:29:03Z

Description

First of all, sorry that this is such a huge PR, but this touches on a lot different things.

This introduces:

smaller in memory edge lists (~60% of the original dataframe size in my benchmarks)
less memory intensive concatenation (~23% of the original memory usage)
lazy data frame representation of the edge list using polars LazyFrame. This allows us to carry out many operations on the edge list without loading all of it into memory, which should improve scalability in many cases.
- the lazy data frames are now used, for checking the edge list size, filtering PixelDataset instances, concatenation, and finding individual components in the edgelist

These changes are backward compatible, but there will be a performance hit (higher memory usage, and longer runtimes) when working with .pxl-files generated with older pixelator versions.

Fixes: https://linear.app/pixelgen-technologies/issue/EXE-1062/reduce-edgelist-memory-usage-in-concatenation

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

This has been tested by manually running concatenation, and manually trying out some of these operations in a quarto notebook.

PR checklist:

ambarrio

I sent you my comments, we already talked a bit over Slack. Let's resolve them and merge on Monday perhaps. So we can also try the 40M limit as well.

ambarrio · 2023-09-20T13:54:36Z

src/pixelator/analysis/colocalization/prepare.py

    Filter markers from a RegionByCountsDataFrame based on how many counts
    available for that marker

    :param df: dataframe to filter
-    :param min_region_counts: minumum number of counts for the marker (exlusive),
+    :param min_marker_counts: minmum number of counts for the marker (exclusive),


minmum -> minimum

What does it mean "exclusive" here? If it means "marker > min_marker_counts", can we document it that way?

Or what is the reason to document it that way?

Yeah, it means marker > min_marker_counts. I can change it.

ambarrio · 2023-09-20T13:58:17Z

src/pixelator/analysis/colocalization/prepare.py

    return markers_per_pixel


 def filter_by_region_counts(
    df: RegionByCountsDataFrame, min_region_counts: int = 5
 ) -> RegionByCountsDataFrame:
-    """
+    """Filter by counts in the region.
+
    Filter regions from a RegionByCountsDataFrame based on
    how many counts are in the region

    :param df: dataframe to filter
    :param min_region_counts: minumum number of counts in region (exlusive),


Same here with the "exclusive" keyword as in L81

src/pixelator/graph/utils.py

ambarrio · 2023-09-20T14:31:31Z

src/pixelator/graph/utils.py

@@ -127,8 +133,9 @@ def create_node_markers_counts(
    :param normalization: selects a normalization method to apply when
                          building neighborhoods
    :returns: a pd.DataFrame with the antibody counts per node
+    :rtype: pd.DataFrame
+    :raises: Assertion error if no 'markers' attribute is found on the vertices


:raises: Assertion error -> :raises AssertionError:

I think both are accepted. After searching we mix a lot in the repo, but I would adopt the second I guess. Just referring to the guidance here:
https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html#the-sphinx-docstring-format

:raises [ErrorType]: [ErrorDescription]

Good catch!

ambarrio · 2023-09-20T14:52:05Z

tests/graph/test_graph.py

@@ -69,6 +71,9 @@ def input_edgelist_fixture(tmp_path, edgelist_with_communities: pd.DataFrame):
        index=False,
    )
    assert len(edgelist_with_communities["component"].unique()) == 1


Maybe document the fixture as it is an input edgelist with an unique component?

tests/graph/test_graph_utils.py

ambarrio · 2023-09-20T20:28:16Z

tests/test_pixeldataset.py

+        assert metadata == dataset_new.metadata
+
+        assert_frame_equal(
+            polarization_scores, dataset_new.polarization, check_dtype=False


Are the dtypes not going to agree here?

So, the reason for this is that we were using the pyarrow extension when reading these data frames (and then not converting them). I dropped this now, since especially for these smaller data frames it doesn't make that much of a difference in read performance/memory usage. We might go back to using pyarrow more extensively in the future, but I think we should wait for the rest of the ecosystem to catch up first.

ambarrio · 2023-09-21T07:27:52Z

tests/test_pixeldataset.py

+    }
+    assert result.dtypes.to_dict() == expected
+
+    _ = _enforce_edgelist_types(result)


Why this second _enforce_edgelist_types call in the test?

I was debugging a thing here and forgot to remove it. I'll clean it up.

src/pixelator/pixeldataset.py

ambarrio · 2023-09-21T13:45:50Z

src/pixelator/pixeldataset.py

    ) -> PixelDataset:
        """Create a new instance of PixelDataset from the provided underlying objects.

        :param adata: an instance of `AnnData`
-        :param edgelist: an edgelist as a `pd.DataFrame`
+        :param edgelist: an edgelist as a `pd.DataFrame`, defaults to None


It doesn't default to None. It is just optional

Should we just delete the defaults to None if we need it not to?

Yes. Good catch. I just missed setting this back since I went a bit back and forth in the implementation of this.

ambarrio

I leave my comments but I think this is ready to go actually, so you have my approval.

The only comment to mark is the sorting when comparing frames

ambarrio · 2023-09-26T10:10:29Z

tests/graph/test_graph_utils.py

@@ -153,7 +162,7 @@ def test_create_node_markers_counts_k_eq_1(pentagram_graph):
        ],
        columns=["A", "B", "C", "D", "E"],
    )
-    expected.columns.name = "markers"
+    expected = _create_df_with_expected_types(expected)
    assert_frame_equal(result, expected)


Why don't we need to .sort_index() in these ones and the ones below?

Some of the methods do not guarantee that the index will be returned sorted, so for tests to work I sort them. I'll make a comment about in the code.

ambarrio · 2023-09-26T11:51:18Z

tests/test_pixeldataset.py

 from pixelator.pixeldataset import (
    FileBasedPixelDatasetBackend,
    ObjectBasedPixelDatasetBackend,
    PixelDataset,
    PixelFileCSVFormatSpec,
    PixelFileFormatSpec,
    PixelFileParquetFormatSpec,
+    _enforce_edgelist_types,


Do we export a private method? Or should we just excapsulate this in a public one?

Or are we including an explicit test on the public method? Ok, I see it is being called when edgelist is retrieved.

This was a pragmatic decision on my part here. In general we aren't testing private methods. But in this case I thought it would useful since it's a rather crucial part of data transformation. I'll add a comment about this in the test to explain my reasoning.

johandahlberg requested review from fbdtemme and ambarrio September 20, 2023 13:29

ambarrio reviewed Sep 21, 2023

View reviewed changes

johandahlberg added 8 commits September 26, 2023 08:33

Add fsspec as dependency

b0174a2

Lazy edgelists, and methods. Smaller dfs.

00dcb6c

Update the changelog

0e168a2

Added missing fastparquet dependency

96101a4

Ignore mypy type on reading with polars

0320922

Minor documentation and liniting conf. fixes

1855d66

Do not use the pyarrow extension to read df's

a48f199

Minor clean-ups

35ddbce

johandahlberg force-pushed the feature/exe-1062-smaller-in-memory-edgelists branch from d8d6e8f to 35ddbce Compare September 26, 2023 08:33

Ignore mypy type error

815c5fc

ambarrio approved these changes Sep 26, 2023

View reviewed changes

Clearifying commments and fixes in doc strings

5b8f43f

johandahlberg merged commit bc8f6a9 into dev Sep 26, 2023
13 checks passed

johandahlberg deleted the feature/exe-1062-smaller-in-memory-edgelists branch September 26, 2023 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller (in-memory) and lazy edge lists #8

Smaller (in-memory) and lazy edge lists #8

johandahlberg commented Sep 20, 2023

ambarrio left a comment

ambarrio Sep 20, 2023

ambarrio Sep 20, 2023

johandahlberg Sep 25, 2023

ambarrio Sep 20, 2023

ambarrio Sep 20, 2023

johandahlberg Sep 25, 2023

ambarrio Sep 20, 2023

ambarrio Sep 20, 2023

ambarrio Sep 20, 2023

johandahlberg Sep 26, 2023

ambarrio Sep 21, 2023

johandahlberg Sep 26, 2023

ambarrio Sep 21, 2023

ambarrio Sep 21, 2023

johandahlberg Sep 26, 2023

ambarrio left a comment

ambarrio Sep 26, 2023

johandahlberg Sep 26, 2023

ambarrio Sep 26, 2023

ambarrio Sep 26, 2023

johandahlberg Sep 26, 2023

Smaller (in-memory) and lazy edge lists #8

Smaller (in-memory) and lazy edge lists #8

Conversation

johandahlberg commented Sep 20, 2023

Description

Type of change

How Has This Been Tested?

PR checklist:

ambarrio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ambarrio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment