refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

tonywu1999 · 2024-07-15T17:15:04Z

Motivation and Context

Using the dataset from the Fragpipe-MSstats Nature Protocols paper, I discovered that there are sections of the MSstats code that are not efficient in space and time.

In one case, when running input_split = split(input, input$PROTEIN) with a single core here, I saw that the memory available decreased by 140GB (using free -g on the terminal), causing the server to crash.

Changes

Because MSstatsSummarize is an already exported function, I created a new function MSstatsSummarizeWithSingleCore that takes in a different input format and processes the input dataset more efficiently in memory.
- I'm open to other suggestions to handle this backwards compatibility issue with MSstatsSummarize.
Replaced MSstatsSummarize with MSstatsSummarizeWithSingleCore in dataProcess
Add a deprecation message of MSstatsSummarize in favor of MSstatsSummarizeWithSingleCore

Testing

All existing unit tests pass
Memory did not decrease with this new implementation
Processing time for protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) was 1 minute 22 seconds for the Fragpipe dataset.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules

…reduce memory and runtime usage

mstaniak · 2024-07-15T18:28:16Z

Thank you, I was aware of issues with split(), but I didn't know how severe they were.
Quick question before even looking at the code:
protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) tells me the code works with indices of the full data.table.
Would it make sense if we instead simply iterated over values of unique(input$PROTEIN)? I prefer to work with data [tables] rather than indices of any sort? Then we could pass [PROTEIN == current_protein] and this could be more consistent with multi-core logic. Please correct me if I'm wrong about the last part

tonywu1999 · 2024-07-16T14:59:31Z

Thank you, I was aware of issues with split(), but I didn't know how severe they were. Quick question before even looking at the code: protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) tells me the code works with indices of the full data.table. Would it make sense if we instead simply iterated over values of unique(input$PROTEIN)? I prefer to work with data [tables] rather than indices of any sort? Then we could pass [PROTEIN == current_protein] and this could be more consistent with multi-core logic. Please correct me if I'm wrong about the last part

Yeah, working with unique(input$PROTEIN) looks to be better. I pushed an additional commit removing any calls to split (it's a pretty time intensive call) and start with the list of unique proteins instead for both single & multi core functions

I still kept track of indices of the unique(input$PROTEIN) list only because later in the functions, those indices are used to store results in the output vector (for single core) and track progress w.r.t. number of proteins processed (for multi-core).

tonywu1999 · 2024-07-16T20:01:49Z

@mstaniak

I just tried out the approach that's the latest commit of this current PR and got the following results with the Nature Protocols dataset (8586 proteins):

Approach 1: Previous commit

protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)): 82 seconds
single_protein = input[protein_indices[[i]],]: 0.02 seconds per protein * 8586 proteins = 171 seconds

Approach 2: Current commit

proteins = unique(input$PROTEIN): 9 seconds
single_protein = input[input$PROTEIN == proteins[[protein_id]],]: 6.5 seconds per protein * 8586 proteins = 55809 seconds

So it seems that approach 1 is faster in runtime by a significant margin

mstaniak · 2024-07-16T21:04:34Z

Thank you for the comparison.
Minor comment: in data.table syntax it's better to use input[PROTEIN == proteins[protein_id] (also better to store proteins in a chr vector), but I doubt it would have a significant impact on performance.
It looks like index-based filtering is simply the fastest. That's OK.

Does it make sense to mix the two approaches: keep a list of indices, but pass data.table after filtering [subsetting] to a function that summarizes for a single protein?

mstaniak · 2024-07-16T21:13:55Z

Btw this might be silly but I did a quick check on another dataset and
single_protein = input[PROTEIN == prot]
averaged 0.02 seconds per protein as well. Could you please check if for this large data set the difference between this syntax and your 6.5 s syntax is meaningful?

tonywu1999 · 2024-07-17T17:23:14Z

@mstaniak

I ended up trying your approach and it definitely is faster (I think it's because it treats PROTEIN as a key and data.table uses a hash-table like mechanism in this case).

Approach 1: Previous commit

protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)): 7 seconds
single_protein = input[protein_indices[[i]],]: 0.02 seconds per protein * 8586 proteins = 171 seconds

Approach 2: Newest commit

proteins = unique(input$PROTEIN): 4 seconds
single_protein = input[PROTEIN == proteins[protein_id]]: 0.08 seconds per protein * 8586 proteins = 686 seconds

Honestly, I think 0.02 seconds vs 0.08 seconds per protein difference doesn't matter too much especially if the bulk of the processing will be in the summarization function per protein + the code looks cleaner with the new approach.

tonywu1999 · 2024-07-17T17:27:44Z

@mstaniak

Thank you for the comparison.
Minor comment: in data.table syntax it's better to use input[PROTEIN == proteins[protein_id] (also better to store proteins in a chr vector), but I doubt it would have a significant impact on performance.
It looks like index-based filtering is simply the fastest. That's OK.

Is this comment still relevant? I ran typeof(proteins) and saw it was an integer vector - does a chr vector make any difference?

Does it make sense to mix the two approaches: keep a list of indices, but pass data.table after filtering [subsetting] to a function that summarizes for a single protein?

Not sure if this comment is still relevant either, but I thought that's what the code is already doing? But I might be misinterpreting here.

tonywu1999 · 2024-07-26T13:21:42Z

Will be reverting back to the first commit for optimal performance. Confirmed with Mateusz that this is the way to go.

This reverts commit 7e26e1a.

This reverts commit 8914e08.

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to …

975a933

…reduce memory and runtime usage

tonywu1999 requested review from mstaniak and devonjkohler July 15, 2024 17:15

remove split function altogether and work with unique proteins

8914e08

modify how protein ID is extracted

7e26e1a

devonjkohler added the enhancement label Jul 18, 2024

tonywu1999 added 2 commits July 26, 2024 09:22

Revert "modify how protein ID is extracted"

8e41015

This reverts commit 7e26e1a.

Revert "remove split function altogether and work with unique proteins"

44096c2

This reverts commit 8914e08.

tonywu1999 merged commit 98a8267 into devel Jul 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

tonywu1999 commented Jul 15, 2024 •

edited

Loading

mstaniak commented Jul 15, 2024

tonywu1999 commented Jul 16, 2024 •

edited

Loading

tonywu1999 commented Jul 16, 2024 •

edited

Loading

mstaniak commented Jul 16, 2024

mstaniak commented Jul 16, 2024

tonywu1999 commented Jul 17, 2024 •

edited

Loading

tonywu1999 commented Jul 17, 2024

tonywu1999 commented Jul 26, 2024 •

edited

Loading

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

Conversation

tonywu1999 commented Jul 15, 2024 • edited Loading

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

mstaniak commented Jul 15, 2024

tonywu1999 commented Jul 16, 2024 • edited Loading

tonywu1999 commented Jul 16, 2024 • edited Loading

mstaniak commented Jul 16, 2024

mstaniak commented Jul 16, 2024

tonywu1999 commented Jul 17, 2024 • edited Loading

tonywu1999 commented Jul 17, 2024

tonywu1999 commented Jul 26, 2024 • edited Loading

tonywu1999 commented Jul 15, 2024 •

edited

Loading

tonywu1999 commented Jul 16, 2024 •

edited

Loading

tonywu1999 commented Jul 16, 2024 •

edited

Loading

tonywu1999 commented Jul 17, 2024 •

edited

Loading

tonywu1999 commented Jul 26, 2024 •

edited

Loading