-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smaller (in-memory) and lazy edge lists #8
Smaller (in-memory) and lazy edge lists #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sent you my comments, we already talked a bit over Slack. Let's resolve them and merge on Monday perhaps. So we can also try the 40M limit as well.
Filter markers from a RegionByCountsDataFrame based on how many counts | ||
available for that marker | ||
|
||
:param df: dataframe to filter | ||
:param min_region_counts: minumum number of counts for the marker (exlusive), | ||
:param min_marker_counts: minmum number of counts for the marker (exclusive), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minmum -> minimum
What does it mean "exclusive" here? If it means "marker > min_marker_counts", can we document it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what is the reason to document it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it means marker > min_marker_counts
. I can change it.
return markers_per_pixel | ||
|
||
|
||
def filter_by_region_counts( | ||
df: RegionByCountsDataFrame, min_region_counts: int = 5 | ||
) -> RegionByCountsDataFrame: | ||
""" | ||
"""Filter by counts in the region. | ||
|
||
Filter regions from a RegionByCountsDataFrame based on | ||
how many counts are in the region | ||
|
||
:param df: dataframe to filter | ||
:param min_region_counts: minumum number of counts in region (exlusive), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with the "exclusive" keyword as in L81
src/pixelator/graph/utils.py
Outdated
@@ -127,8 +133,9 @@ def create_node_markers_counts( | |||
:param normalization: selects a normalization method to apply when | |||
building neighborhoods | |||
:returns: a pd.DataFrame with the antibody counts per node | |||
:rtype: pd.DataFrame | |||
:raises: Assertion error if no 'markers' attribute is found on the vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:raises: Assertion error -> :raises AssertionError:
I think both are accepted. After searching we mix a lot in the repo, but I would adopt the second I guess. Just referring to the guidance here:
https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html#the-sphinx-docstring-format
:raises [ErrorType]: [ErrorDescription]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@@ -69,6 +71,9 @@ def input_edgelist_fixture(tmp_path, edgelist_with_communities: pd.DataFrame): | |||
index=False, | |||
) | |||
assert len(edgelist_with_communities["component"].unique()) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document the fixture as it is an input edgelist with an unique component?
tests/test_pixeldataset.py
Outdated
assert metadata == dataset_new.metadata | ||
|
||
assert_frame_equal( | ||
polarization_scores, dataset_new.polarization, check_dtype=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the dtype
s not going to agree here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the reason for this is that we were using the pyarrow
extension when reading these data frames (and then not converting them). I dropped this now, since especially for these smaller data frames it doesn't make that much of a difference in read performance/memory usage. We might go back to using pyarrow
more extensively in the future, but I think we should wait for the rest of the ecosystem to catch up first.
tests/test_pixeldataset.py
Outdated
} | ||
assert result.dtypes.to_dict() == expected | ||
|
||
_ = _enforce_edgelist_types(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this second _enforce_edgelist_types
call in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was debugging a thing here and forgot to remove it. I'll clean it up.
src/pixelator/pixeldataset.py
Outdated
) -> PixelDataset: | ||
"""Create a new instance of PixelDataset from the provided underlying objects. | ||
|
||
:param adata: an instance of `AnnData` | ||
:param edgelist: an edgelist as a `pd.DataFrame` | ||
:param edgelist: an edgelist as a `pd.DataFrame`, defaults to None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't default to None
. It is just optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just delete the defaults to None if we need it not to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Good catch. I just missed setting this back since I went a bit back and forth in the implementation of this.
d8d6e8f
to
35ddbce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave my comments but I think this is ready to go actually, so you have my approval.
The only comment to mark is the sorting when comparing frames
@@ -153,7 +162,7 @@ def test_create_node_markers_counts_k_eq_1(pentagram_graph): | |||
], | |||
columns=["A", "B", "C", "D", "E"], | |||
) | |||
expected.columns.name = "markers" | |||
expected = _create_df_with_expected_types(expected) | |||
assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to .sort_index()
in these ones and the ones below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the methods do not guarantee that the index will be returned sorted, so for tests to work I sort them. I'll make a comment about in the code.
from pixelator.pixeldataset import ( | ||
FileBasedPixelDatasetBackend, | ||
ObjectBasedPixelDatasetBackend, | ||
PixelDataset, | ||
PixelFileCSVFormatSpec, | ||
PixelFileFormatSpec, | ||
PixelFileParquetFormatSpec, | ||
_enforce_edgelist_types, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we export a private method? Or should we just excapsulate this in a public one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or are we including an explicit test on the public method? Ok, I see it is being called when edgelist
is retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a pragmatic decision on my part here. In general we aren't testing private methods. But in this case I thought it would useful since it's a rather crucial part of data transformation. I'll add a comment about this in the test to explain my reasoning.
Description
First of all, sorry that this is such a huge PR, but this touches on a lot different things.
This introduces:
LazyFrame
. This allows us to carry out many operations on the edge list without loading all of it into memory, which should improve scalability in many cases.PixelDataset
instances, concatenation, and finding individual components in the edgelistThese changes are backward compatible, but there will be a performance hit (higher memory usage, and longer runtimes) when working with .pxl-files generated with older pixelator versions.
Fixes: https://linear.app/pixelgen-technologies/issue/EXE-1062/reduce-edgelist-memory-usage-in-concatenation
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
This has been tested by manually running concatenation, and manually trying out some of these operations in a quarto notebook.
PR checklist: