Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

tonywu1999
Copy link
Contributor

@tonywu1999 tonywu1999 commented Jul 15, 2024

Motivation and Context

Using the dataset from the Fragpipe-MSstats Nature Protocols paper, I discovered that there are sections of the MSstats code that are not efficient in space and time.

In one case, when running input_split = split(input, input$PROTEIN) with a single core here, I saw that the memory available decreased by 140GB (using free -g on the terminal), causing the server to crash.

Changes

  • Because MSstatsSummarize is an already exported function, I created a new function MSstatsSummarizeWithSingleCore that takes in a different input format and processes the input dataset more efficiently in memory.
    • I'm open to other suggestions to handle this backwards compatibility issue with MSstatsSummarize.
  • Replaced MSstatsSummarize with MSstatsSummarizeWithSingleCore in dataProcess
  • Add a deprecation message of MSstatsSummarize in favor of MSstatsSummarizeWithSingleCore

Testing

  • All existing unit tests pass
  • Memory did not decrease with this new implementation
  • Processing time for protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) was 1 minute 22 seconds for the Fragpipe dataset.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@mstaniak
Copy link
Contributor

Thank you, I was aware of issues with split(), but I didn't know how severe they were.
Quick question before even looking at the code:
protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) tells me the code works with indices of the full data.table.
Would it make sense if we instead simply iterated over values of unique(input$PROTEIN)? I prefer to work with data [tables] rather than indices of any sort? Then we could pass [PROTEIN == current_protein] and this could be more consistent with multi-core logic. Please correct me if I'm wrong about the last part

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Jul 16, 2024

Thank you, I was aware of issues with split(), but I didn't know how severe they were. Quick question before even looking at the code: protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)) tells me the code works with indices of the full data.table. Would it make sense if we instead simply iterated over values of unique(input$PROTEIN)? I prefer to work with data [tables] rather than indices of any sort? Then we could pass [PROTEIN == current_protein] and this could be more consistent with multi-core logic. Please correct me if I'm wrong about the last part

Yeah, working with unique(input$PROTEIN) looks to be better. I pushed an additional commit removing any calls to split (it's a pretty time intensive call) and start with the list of unique proteins instead for both single & multi core functions

I still kept track of indices of the unique(input$PROTEIN) list only because later in the functions, those indices are used to store results in the output vector (for single core) and track progress w.r.t. number of proteins processed (for multi-core).

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Jul 16, 2024

@mstaniak

I just tried out the approach that's the latest commit of this current PR and got the following results with the Nature Protocols dataset (8586 proteins):

Approach 1: Previous commit

  • protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)): 82 seconds
  • single_protein = input[protein_indices[[i]],]: 0.02 seconds per protein * 8586 proteins = 171 seconds

Approach 2: Current commit

  • proteins = unique(input$PROTEIN): 9 seconds
  • single_protein = input[input$PROTEIN == proteins[[protein_id]],]: 6.5 seconds per protein * 8586 proteins = 55809 seconds

So it seems that approach 1 is faster in runtime by a significant margin

@mstaniak
Copy link
Contributor

Thank you for the comparison.
Minor comment: in data.table syntax it's better to use input[PROTEIN == proteins[protein_id] (also better to store proteins in a chr vector), but I doubt it would have a significant impact on performance.
It looks like index-based filtering is simply the fastest. That's OK.

Does it make sense to mix the two approaches: keep a list of indices, but pass data.table after filtering [subsetting] to a function that summarizes for a single protein?

@mstaniak
Copy link
Contributor

Btw this might be silly but I did a quick check on another dataset and
single_protein = input[PROTEIN == prot]
averaged 0.02 seconds per protein as well. Could you please check if for this large data set the difference between this syntax and your 6.5 s syntax is meaningful?

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Jul 17, 2024

@mstaniak

I ended up trying your approach and it definitely is faster (I think it's because it treats PROTEIN as a key and data.table uses a hash-table like mechanism in this case).

Approach 1: Previous commit

  • protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN)): 7 seconds
  • single_protein = input[protein_indices[[i]],]: 0.02 seconds per protein * 8586 proteins = 171 seconds

Approach 2: Newest commit

  • proteins = unique(input$PROTEIN): 4 seconds
  • single_protein = input[PROTEIN == proteins[protein_id]]: 0.08 seconds per protein * 8586 proteins = 686 seconds

Honestly, I think 0.02 seconds vs 0.08 seconds per protein difference doesn't matter too much especially if the bulk of the processing will be in the summarization function per protein + the code looks cleaner with the new approach.

@tonywu1999
Copy link
Contributor Author

@mstaniak

Thank you for the comparison.
Minor comment: in data.table syntax it's better to use input[PROTEIN == proteins[protein_id] (also better to store proteins in a chr vector), but I doubt it would have a significant impact on performance.
It looks like index-based filtering is simply the fastest. That's OK.

Is this comment still relevant? I ran typeof(proteins) and saw it was an integer vector - does a chr vector make any difference?

Does it make sense to mix the two approaches: keep a list of indices, but pass data.table after filtering [subsetting] to a function that summarizes for a single protein?

Not sure if this comment is still relevant either, but I thought that's what the code is already doing? But I might be misinterpreting here.

@tonywu1999
Copy link
Contributor Author

tonywu1999 commented Jul 26, 2024

Will be reverting back to the first commit for optimal performance. Confirmed with Mateusz that this is the way to go.

@tonywu1999 tonywu1999 merged commit 98a8267 into devel Jul 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants