-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(MSstatsSummarize): Create MSstatsSummarizeWithSingleCore to reduce memory and runtime usage #126
Conversation
…reduce memory and runtime usage
Thank you, I was aware of issues with split(), but I didn't know how severe they were. |
Yeah, working with I still kept track of indices of the |
I just tried out the approach that's the latest commit of this current PR and got the following results with the Nature Protocols dataset (8586 proteins): Approach 1: Previous commit
Approach 2: Current commit
So it seems that approach 1 is faster in runtime by a significant margin |
Thank you for the comparison. Does it make sense to mix the two approaches: keep a list of indices, but pass data.table after filtering [subsetting] to a function that summarizes for a single protein? |
Btw this might be silly but I did a quick check on another dataset and |
I ended up trying your approach and it definitely is faster (I think it's because it treats PROTEIN as a key and data.table uses a hash-table like mechanism in this case). Approach 1: Previous commit
Approach 2: Newest commit
Honestly, I think 0.02 seconds vs 0.08 seconds per protein difference doesn't matter too much especially if the bulk of the processing will be in the summarization function per protein + the code looks cleaner with the new approach. |
Is this comment still relevant? I ran
Not sure if this comment is still relevant either, but I thought that's what the code is already doing? But I might be misinterpreting here. |
Will be reverting back to the first commit for optimal performance. Confirmed with Mateusz that this is the way to go. |
Motivation and Context
Using the dataset from the Fragpipe-MSstats Nature Protocols paper, I discovered that there are sections of the MSstats code that are not efficient in space and time.
In one case, when running
input_split = split(input, input$PROTEIN)
with a single core here, I saw that the memory available decreased by 140GB (using free -g on the terminal), causing the server to crash.Changes
dataProcess
Testing
protein_indices = split(seq_len(nrow(input)), list(input$PROTEIN))
was 1 minute 22 seconds for the Fragpipe dataset.Checklist Before Requesting a Review