Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smaller (in-memory) and lazy edge lists #8

Merged
merged 10 commits into from
Sep 26, 2023

Conversation

johandahlberg
Copy link
Contributor

Description

First of all, sorry that this is such a huge PR, but this touches on a lot different things.

This introduces:

  • smaller in memory edge lists (~60% of the original dataframe size in my benchmarks)
  • less memory intensive concatenation (~23% of the original memory usage)
  • lazy data frame representation of the edge list using polars LazyFrame. This allows us to carry out many operations on the edge list without loading all of it into memory, which should improve scalability in many cases.
    • the lazy data frames are now used, for checking the edge list size, filtering PixelDataset instances, concatenation, and finding individual components in the edgelist

These changes are backward compatible, but there will be a performance hit (higher memory usage, and longer runtimes) when working with .pxl-files generated with older pixelator versions.

Fixes: https://linear.app/pixelgen-technologies/issue/EXE-1062/reduce-edgelist-memory-usage-in-concatenation

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

This has been tested by manually running concatenation, and manually trying out some of these operations in a quarto notebook.

PR checklist:

  • This comment contains a description of changes (with reason).
  • My code follows the style guidelines of this project
  • Make sure your code lints, is well-formatted and type checks
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • If you've added a new stage - have you followed the pipeline CLI conventions in the USAGE guide
  • README.md is updated
  • If a new tool or package is included, update poetry.lock, the conda recipe and cited it properly
  • Include any new authors/contributors
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and documentation and corrected any misspellings
  • Usage Documentation is updated
  • If you are doing a release, or a significant change to the code, update CHANGELOG.md

Copy link
Contributor

@ambarrio ambarrio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sent you my comments, we already talked a bit over Slack. Let's resolve them and merge on Monday perhaps. So we can also try the 40M limit as well.

Filter markers from a RegionByCountsDataFrame based on how many counts
available for that marker

:param df: dataframe to filter
:param min_region_counts: minumum number of counts for the marker (exlusive),
:param min_marker_counts: minmum number of counts for the marker (exclusive),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minmum -> minimum

What does it mean "exclusive" here? If it means "marker > min_marker_counts", can we document it that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or what is the reason to document it that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it means marker > min_marker_counts. I can change it.

return markers_per_pixel


def filter_by_region_counts(
df: RegionByCountsDataFrame, min_region_counts: int = 5
) -> RegionByCountsDataFrame:
"""
"""Filter by counts in the region.

Filter regions from a RegionByCountsDataFrame based on
how many counts are in the region

:param df: dataframe to filter
:param min_region_counts: minumum number of counts in region (exlusive),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the "exclusive" keyword as in L81

src/pixelator/graph/utils.py Show resolved Hide resolved
@@ -127,8 +133,9 @@ def create_node_markers_counts(
:param normalization: selects a normalization method to apply when
building neighborhoods
:returns: a pd.DataFrame with the antibody counts per node
:rtype: pd.DataFrame
:raises: Assertion error if no 'markers' attribute is found on the vertices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:raises: Assertion error -> :raises AssertionError:

I think both are accepted. After searching we mix a lot in the repo, but I would adopt the second I guess. Just referring to the guidance here:
https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html#the-sphinx-docstring-format

:raises [ErrorType]: [ErrorDescription]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@@ -69,6 +71,9 @@ def input_edgelist_fixture(tmp_path, edgelist_with_communities: pd.DataFrame):
index=False,
)
assert len(edgelist_with_communities["component"].unique()) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document the fixture as it is an input edgelist with an unique component?

tests/graph/test_graph_utils.py Show resolved Hide resolved
assert metadata == dataset_new.metadata

assert_frame_equal(
polarization_scores, dataset_new.polarization, check_dtype=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the dtypes not going to agree here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the reason for this is that we were using the pyarrow extension when reading these data frames (and then not converting them). I dropped this now, since especially for these smaller data frames it doesn't make that much of a difference in read performance/memory usage. We might go back to using pyarrow more extensively in the future, but I think we should wait for the rest of the ecosystem to catch up first.

}
assert result.dtypes.to_dict() == expected

_ = _enforce_edgelist_types(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this second _enforce_edgelist_types call in the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debugging a thing here and forgot to remove it. I'll clean it up.

src/pixelator/pixeldataset.py Show resolved Hide resolved
) -> PixelDataset:
"""Create a new instance of PixelDataset from the provided underlying objects.

:param adata: an instance of `AnnData`
:param edgelist: an edgelist as a `pd.DataFrame`
:param edgelist: an edgelist as a `pd.DataFrame`, defaults to None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't default to None. It is just optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just delete the defaults to None if we need it not to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Good catch. I just missed setting this back since I went a bit back and forth in the implementation of this.

@johandahlberg johandahlberg force-pushed the feature/exe-1062-smaller-in-memory-edgelists branch from d8d6e8f to 35ddbce Compare September 26, 2023 08:33
Copy link
Contributor

@ambarrio ambarrio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave my comments but I think this is ready to go actually, so you have my approval.

The only comment to mark is the sorting when comparing frames

@@ -153,7 +162,7 @@ def test_create_node_markers_counts_k_eq_1(pentagram_graph):
],
columns=["A", "B", "C", "D", "E"],
)
expected.columns.name = "markers"
expected = _create_df_with_expected_types(expected)
assert_frame_equal(result, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to .sort_index() in these ones and the ones below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the methods do not guarantee that the index will be returned sorted, so for tests to work I sort them. I'll make a comment about in the code.

from pixelator.pixeldataset import (
FileBasedPixelDatasetBackend,
ObjectBasedPixelDatasetBackend,
PixelDataset,
PixelFileCSVFormatSpec,
PixelFileFormatSpec,
PixelFileParquetFormatSpec,
_enforce_edgelist_types,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we export a private method? Or should we just excapsulate this in a public one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are we including an explicit test on the public method? Ok, I see it is being called when edgelist is retrieved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a pragmatic decision on my part here. In general we aren't testing private methods. But in this case I thought it would useful since it's a rather crucial part of data transformation. I'll add a comment about this in the test to explain my reasoning.

@johandahlberg johandahlberg merged commit bc8f6a9 into dev Sep 26, 2023
13 checks passed
@johandahlberg johandahlberg deleted the feature/exe-1062-smaller-in-memory-edgelists branch September 26, 2023 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants