-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking of the id dda workflow (ms2rescore, percolator, SNR) #410
Comments
PXD014415Currently, we have a workflow that can perform peptide identification using: Each of these combinations can be turned off. We used the dataset Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD014415-id-ms2rescore/). Combinations & PSMs counts:
Total number of PSMs by RAW file and combination
|
What is "non-sage"? Comet? |
Sorry the non-sage is COMET+MSGF |
Sage comes on top or as replacement? How expensive is the snr/feature calculation? I think we could improve the pyopenms script if it is expensive. Have you tried a more robust snr estimation? I.e. RMS of Top 10 / RMS of all? The max seems prone to outliers but seems to work well enough. Have you planned any false positive evaluation (like ground truth, entrapment, cross-species)? |
On top.
It is really fast, then there is no urgent need for improvements.
We can add more metrics like RMS of the Top 10. Feel free to do a PR to quantms utils https://github.com/bigbio/quantms-utils/blob/main/quantmsutils/features/snr.py
Im listening to suggestions. I would love to evaluate if this 5% increase in the PSMs in some way affects the FDR? Also, Im listening to suggestions on how to evaluate the difference between SNR+MS2rescore and MS2rescore. I have manually checked some IDs (in proteogenomics - https://www.biorxiv.org/content/10.1101/2024.05.24.595489v1) and I know that ms2rescore in the low-quality spectra can save (identified) some low-quality spectra that is the reason why we added the SNR. Would be nice to have some benchmark to prove it. |
I was reading today the MSAmanda + ms2rescore and the % increase in PSMs is 6%. |
Any increase in #PSMs will depend on the type of dataset and search space. Generally, we see modest increases for simple searches (for instance the yeast UPS search above) and significant increases for difficult searches (46% for immunopeptidomics, 10.1016/j.mcpro.2022.100266). Note that even with modest increases in sensitivity, the separation between true and false PSMs is expected to be better, which means that in most cases you could increase the specificity to 0.1% FDR without loosing sensitivity (for instance shown in doi:10.1016/j.mcpro.2021.100076). |
Thanks @RalfG for this response:
How do you test this? Distribution of the PEP scores or the original scores for targets and decoys? |
Usually just by plotting the amount of confidently identified PSMs at each FDR threshold, as in figure 1 of doi:10.1016/j.mcpro.2021.100076. |
I'm a bit curious about
Did you check the feature weights of percolator for this feature? I would guess that the Comet Xcorr implicitly penalizes for high SNR: https://willfondrie.com/2019/02/an-intuitive-look-at-the-xcorr-score-function-in-proteomics/ Would be great to see how high search engine scores / predicted features weights are in percolator! |
Thanks for your suggestions. There are latest benchmark results from PXD001819 and PXD014415. The percolator top20 weights are shown in figure3 and figure4 (Top panel is comet, bottom panel is msgf). And the SNR features are plotted in figure5 and figure6. (a) is percolator method, (b) is ms2rescore and (c) is ms2rescore+snr. I think we can get some conclusion:
Looking forward to your feedback! |
Nice Job 👌🏼 are any of the Quantms Features correlated with the ms2pip or other Features? Also, what exactly is "number of identified spectra"? (PSMs? Peptides?) |
Nice results! Cool to see that combining multiple search engines improves identification rates. How would you interpret this? Would this come from differences in candidate PSM generation? I find the radar charts a bit difficult to compare, as the features are ordered differently across each plot. A bit unfortunate to see only little differences when adding SNR features. In any case they do seem to help a bit? Would be nice to see the SNR feature distributions for accepted targets, rejected targets, and decoys. A bit similar to this: Or you could do something like this: |
PXD001819 Analysis
Currently, we have a workflow that can perform peptide identification using:
-> ms2rescore -> SNR + spectrum properties -> percolator
Here the results can be found (https://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819-id-ms2rescore/).
Total number of PMSs
Comet only + Percolator: 495306
Comet + MSGF + Percolator: 572496 (15.58% increase)
Comet + MSGF + ms2rescore: 589200 (18.95% increase)
Comet + MSGF + (SNR + ms2rescore): 587972 (18.71% increase)
Comet + MSGF + SAGE + (SNR + ms2rescore): 592918 (19.68% increase)
Total number of PSMs by RAW file and combination
The following questions would be interesting to understand:
The text was updated successfully, but these errors were encountered: