-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAGE failing in big database #327
Comments
The dataset is PXD010154 and the database is: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/proteogenomics/noncanonical-tissues-2023/GRCh38r110_GCA97s_noncanonical_proteins_19Jul23-decoy.fa |
Ping here, @lazear this is the same dataset we discussed before but with a larger database, all my jobs are failing, Im splitting with quantms into 100 jobs with 10 files per job and even in 700GB memory with 48 CPUs it is failing. |
If the database is too big, using more jobs will be worse since sage has to build the index separately in each job. Sage is designed for extremely efficient single node, single job tasks, so splitting into smaller jobs will almost always be worse than bigger CPU jobs with more files. The database needs to be reduced in size, and jobs run over the database in chunks, and then combine the results at the end |
@lazear thanks for the quick reply. It will take me some time to implement a chuck solution in quantms. In the meantime, do you have an estimation (even if is far from exactly) How much memory in the node is needed for a 2GB FASTA database? |
I don't have an estimation, but it's a very bad idea. Fragment indexing is largely memory bandwidth bound - larger databases cause dramatic slowdowns and the algorithm stops being able to saturate CPUs at some level (around 10x larger than human canonical uniprot sized with trypsin + 2 missed cleavages). Database splitting is absolutely necessary for good performance at a database of that size |
@jpfeuffer @daichengxin @timosachsenberg How can we implement in quantms a logic to partition the database for searches where extensive databases are searched? I understand from @lazear here that it will not be possible for SAGE to have a way around large databases. However, @lazear is suggesting that instead we partitionable the database, and search all the raw files against each chuck, in this case quantms will not be partitioning/distributing files but data (fasta files). I think this is doable but we need the following:
What do you think? |
It is hard to predict from the outside which/how many sequences to put into the individual chunks. I think in Sage one could (in principle) stop generating the fragment index, perform the search, and then later merge results from individual chunks. This might lead to some peptides being searched in multiple times but well ... I think MSFragger does something similar. |
Ah I think I misunderstood the issue. I thought it is just about fitting the fragment index into memory, but this is also about distributing it on several nodes? |
What's needed is just a way to distribute different FASTA files across nodes (analogous to distributing different raw files). You could divide your FASTA into 100 parts, and spin up 100 jobs to process each chunk separately (thus dramatically reduced fragment index memory use), and then combine the results. Since you guys already have a pipeline in place for combining results from multiple searches, rescoring, and doing protein inference it should work pretty well - just need to take the best hit from each spectrum, since there will be 100 best hits (1 from each search). I will also try to work on database splitting, since once semi-enzymatic is officially released I imagine I will get many more requests for it. |
Actually @lazear we were internally discussing the issue and trying to find out a solution and @timosachsenberg point it out that another case is semi-enzymatic analysis which also increase the size of the search space dramatically. We actually think (and we were planning to discuss with you), that it is better to have a solution within SAGE that handle both cases, big databases or semi-enzymatic searches. The problem with splitting databases, is the target decoy approach, we will need to control the distribution of targets and decoys and also make sure other search engines supported by the tool are also handling well the chunks. We are happy to discuss here with you possible solution because as you said we are already parallelizing the searches by msrun. |
Why would the distribution of targets and decoys matter? If you are performing external competition on combined search results (e.g. taking the highest scoring PSM for that spectrum) it shouldn't matter if you did one search with 100% targets and another with 100% decoys - the end result should be exactly the same. Sage, Comet, etc also will happily generate decoys for you |
I think if the database is split into several parts, we need to make sure that decoys of the same peptides in different chunks are generated deterministically. Otherwise, we might end of up with more decoys per peptide. |
OK, so after doing some hacking I can offer at least a partial solution here. You'll need enough memory to fit all of the I have modified the codebase so that more than 2^32 peptides can be used, and to defer fragment ion generation - this means that mzMLs can be sorted by precursor mass and processed in chunks - a new fragment ion index will be generated for each chunk that contains only peptides within the monoisotopic mass range of those MS2 precursors. This essentially defers generating the actual theoretical fragments until they are needed, and should eliminate a substantial portion of memory use, but not all of it - the It might be possible to modify Sage further to parse FASTA files in chunks on-the-fly, but that will require more work than I am willing to do right now (at the moment the program basically requires that the full set of |
Description of the bug
The text was updated successfully, but these errors were encountered: