Implementing an occam's razor in the Protein(s) field of the Default PSM Report #316

antortjim · 2018-06-11T07:43:22Z

Hi there!

I was wondering if it would be possible to simplify the protein groups in the PSM Report produced by Peptide Shaker using some sort of Occam's razor approach. I opened an issue in the moFF repo (compomics/moFF#28) regarding this, but maybe it should be handled in PeptideShaker, since it is the program producing the protein group.

The goal would be to have protein ids appear in only one protein group, similar to what the proteinsGroup.txt file from MaxQuant. This would make interpretation of the results much easier.

Thanks beforehand!

Best regards
Antonio

hbarsnes · 2018-06-14T15:41:55Z

Hi Antonio,

This is not supported at the moment, but we'll keep it in mind for future releases.

Not sure if making the results interpretation easier is a good enough argument for (over-)simplifying the data though? But I guess that is a bigger discussion. ;)

Best regards,
Harald

antortjim · 2018-06-14T21:50:53Z

Hi Harald

It definitely is a bigger question 😇 but on the other hand, it does make sense to make protein ids unique to a single group.. it's a logical processing step to perform isn't it?

Thanks for looking at it

Best
Antonio

hbarsnes · 2018-06-15T10:04:46Z

Hi Antonio,

but on the other hand, it does make sense to make protein ids unique to a single group.. it's a logical processing step to perform isn't it?

Not sure I would call it logical given that protein inference issues is an inherent part of bottom up proteomics. There is therefore no way of getting around the issue of shared peptides in a proper way. Well, besides kicking them out, which is what one often ends up doing when quantifying.

You can find more details, including examples, in Chapter 1.4 (from page 24 onwards) in our tutorial material: https://compomics.com/bioinformatics-for-proteomics.

Best regards,
Harald

antortjim · 2018-06-16T09:30:47Z

Hi Harald

Thanks for the suggestion! I already went through the tutorials and I actually think they are great 👍 .
What I meant by logical is that in the end one needs to be able to provide a single quantification value for each protein group. If a protein appears in multiple groups, then one cannot provide such a value. It's a big problem for sure.

For comparison with transcriptomics, this is I guess equivalent to the multimappers problem i.e a read that maps to > 1 place in the genome. I think the usual procedure is to discard them isn't it? So in the Endoplasmin illustration from page 34 in Chapter 1.4, this would mean that the 4 peptides that could also come from other proteins would be discarded in the quantification process. It's of course not a perfect solution, but it's a way to handle this.

Please don't hesitate to correct me if I am wrong!

Thanks for your time and your input

Best regards,
Antonio

hbarsnes · 2018-06-16T21:45:21Z

Hi Antonio,

What I meant by logical is that in the end one needs to be able to provide a single quantification value for each protein group.

I think you're mixing "needs" with "wants". ;)

Of course one would prefer to be able to simplify the protein groups down to a single protein per group, but in many cases this is not possible. The simplest example is if you have a set of peptides that can map to two distinct proteins (with no additional unique peptides for either protein). Given the information from a bottom up proteomics experiement there is then simply no way in which one can confidently pick one protein over the other as the representative of that group. And given that the proteins can be very different proteins (in terms of biological context), making the wrong call could greatly mess up the downstream analysis.

I think the usual procedure is to discard them isn't it? So in the Endoplasmin illustration from page 34 in Chapter 1.4, this would mean that the 4 peptides that could also come from other proteins would be discarded in the quantification process. It's of course not a perfect solution, but it's a way to handle this

Yes, this is what one often ends up doing, especially as part of the quantification. But as you say, it's not perfect. But at least one is not making decissions without any data to back it up.

I guess what I'm trying to say is that there is no way around protein inference issues in bottom up proteomics, and in PeptideShaker we prefer to show the complexity of the protein grouping and let the users decide how to deal with it rather than to try to simplify the groups via a relatively random selection of group representatives.

That is not to say that we don't do our best to clean up the protein groups and select the best protein group representative possible, e.g. the protein with the highest evidence level in UniProt. This additional information (also indicated in the color coding of the protein groups) could potentially be used when deciding which groups to remove and which to keep. For example, if a given protein group consists of two closely-related isoforms, then perhaps keeping the group (and quantifying the two proteins together) is not a big problem. Unless one particularly wants to quantify the difference between the two isoforms of course.

Best regards,
Harald

antortjim · 2018-06-22T14:59:10Z

Hi Harald

Sorry for taking a few days to come back to you and thank you so much for your detailed answer 😄

Yeah, I definitely mistook need for want! I understand keeping protein groups with multiple proteins does make complete sense, as many times one just cannot differentiate between them, particularly when they exhibit relatedness according to the annotation. A clear case of that is as you mentioned it, isoforms. The only way to differentiate between them would be to catch peptides exhibiting isoform-specific sequences, and still that would not solve what to do with all those that map to both. The latter need to be mapped to a protein group containing both isoforms, while the former could be mapped to a protein group containing the single isoform.

The big question then would be how to deal with the isoform present in both (1) its own protein group and (2) the protein group shared with the other isoforms. My intuition tells me that one valid approach would be combining all their peptides into a single protein group containing all the isoforms.

It is an even harder problem to solve when not all proteins available in the smaller groups are contained within the biggest protein group, like in the situation below:

P1;
P1;P4
P2
P3
P4
P1;P2;P3

The individual groups P1, P2, P3, could be combined with P1;P2;P3, so all the peptides are assigned to a single group. But what to do with P1;P4? I guess it depends on the P1-P4 relatedness.

For now I am applying the logic followed by the MSqRob developers, implemented in their smallestUniqueGroups.R function, which drops protein groups containing proteins which are simultaneously contained in smaller protein groups. Far from perfect, but a simple way of getting each protein id to occur in a single protein group.

Please correct me if I am wrong, but I think peptideShaker could offer some Occam's razor algorithms implementing different solutions to the problems above, for fine tuning the protein inference behaviour of the program.

For those interested: I am using MSqRob by passing the output of moFF, which in turn is just a processed form of the peptideShaker DefaultPSMReport.txt upon which Match Between Runs and Apex Intensity Extraction has been performed. Their Occam's Razor implementation is executed when running the preprocess_MSnSet() function with the flag smallestUniqueGroups set to TRUE.

Thanks once again for your contributions

Best regards

Antonio

hbarsnes · 2018-06-25T21:41:24Z

Hi Antonio,

drops protein groups containing proteins which are simultaneously contained in smaller protein groups. Far from perfect, but a simple way of getting each protein id to occur in a single protein group.

All the protein groups in PeptideShaker contain at least one peptide that is unique to the protein(s) in that group. I don't see how any groups can be automatically dropped wihout also loosing potentially vital information?

But you are of course free to filter the output in the post-processing. For example, by only keeping protein groups consiting of single proteins, or perhaps also including groups consisting of (assumed) related proteins.

Best regards,
Harald

antortjim · 2018-07-03T14:16:20Z

Hi Harald

Thank you for your answer! I agree that of course if one drops groups, vital information can be lost. But I would love to have a utility implementing different heuristics that potentially fit different purposes. So running a single command one could set at will how the protein inference problem is dealt with. For now I am using the smallestUniqueGroups I refered to above, but maybe more powerful methods could be made available.

Thanks for your great dedication.

Best regards

Antonio

hbarsnes self-assigned this Jun 14, 2018

hbarsnes added the enhancement label Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing an occam's razor in the Protein(s) field of the Default PSM Report #316

Implementing an occam's razor in the Protein(s) field of the Default PSM Report #316

antortjim commented Jun 11, 2018

hbarsnes commented Jun 14, 2018

antortjim commented Jun 14, 2018

hbarsnes commented Jun 15, 2018

antortjim commented Jun 16, 2018

hbarsnes commented Jun 16, 2018

antortjim commented Jun 22, 2018

hbarsnes commented Jun 25, 2018

antortjim commented Jul 3, 2018

Implementing an occam's razor in the Protein(s) field of the Default PSM Report #316

Implementing an occam's razor in the Protein(s) field of the Default PSM Report #316

Comments

antortjim commented Jun 11, 2018

hbarsnes commented Jun 14, 2018

antortjim commented Jun 14, 2018

hbarsnes commented Jun 15, 2018

antortjim commented Jun 16, 2018

hbarsnes commented Jun 16, 2018

antortjim commented Jun 22, 2018

hbarsnes commented Jun 25, 2018

antortjim commented Jul 3, 2018