Lower than expected number of proteins in spectral library #282

silasmellor · 2022-01-17T21:32:25Z

silasmellor
Jan 17, 2022

Hi, Let me just preface by saying i am still fairly new to untargeted proteomics, so bear with me.

I have tried to use DIA-NN to analyze a dataset of timsTOF diaPASEF data. So far i have tried a few things. I started out trying the option to generate in-silico spectral library from FASTA. The results of this gives me a fairly low number of proteins in the library (around 3300 proteins), whereas the FASTA file contains about 36000 protein sequences.

I next went on to try generating a spectral library from DDA runs (also timsTOF), run on the same samples. This was done using fragpipe, and generated a spectral library of about 10800 proteins. When i use this library in DIA-NN for the DIA files, i again see only a low number of proteins after the program loads the FASTA file.

Am i missing something? Attached is the first part of the log for reference.

Any help much appreciated,
Best,
Silas

diann.exe --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S1_A1_1_754.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S2_A2_1_755.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S3_A3_1_756.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S4_A4_1_757.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S5_A5_1_758.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S6_A6_1_759.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S7_A7_1_760.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S8_A8_1_761.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S9_A9_1_762.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S10_A10_1_763.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S11_A11_1_764.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S12_A12_1_765.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S13_B1_1_766.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S14_B2_1_767.d
" --f "D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S15_B3_1_768.d
" --lib "D:\Fragpipe\DDA-inflata_spec-lib\library.tsv" --threads 6 --verbose 1 --out "D:\R0270\DIA-NN\220117 Inflata test\report.tsv" --qvalue 0.01 --matrices --temp "D:\R0270\DIA-NN\220117 Inflata test" --reannotate --fasta "D:\R0270\Petunia fasta files\Petunia_inflata_v1.0.1_proteins.fasta" --met-excision --cut K*,R* --missed-cleavages 2 --min-pep-len 7 --max-pep-len 52 --min-pr-mz 400 --max-pr-mz 1201 --min-pr-charge 2 --max-pr-charge 4 --unimod4 --var-mods 5 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,n --monitor-mod UniMod:1 --use-quant --double-search --no-prot-inf --reanalyse --smart-profiling --peak-center
DIA-NN 1.8 (Data-Independent Acquisition by Neural Networks)
Compiled on Jun 28 2021 14:55:31
Current date and time: Mon Jan 17 21:57:15 2022
CPU: GenuineIntel Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
SIMD instructions: AVX AVX2 FMA SSE4.1 SSE4.2
Logical CPU cores: 12
Thread number set to 6
Output will be filtered at 0.01 FDR
Precursor/protein x samples expression level matrices will be saved along with the main report
Library precursors will be reannotated using the FASTA database
N-terminal methionine excision enabled
In silico digest will involve cuts at K,R*
Maximum number of missed cleavages set to 2
Min peptide length set to 7
Max peptide length set to 52
Min precursor m/z set to 400
Max precursor m/z set to 1201
Min precursor charge set to 2
Max precursor charge set to 4
Cysteine carbamidomethylation enabled as a fixed modification
Maximum number of variable modifications set to 5
Modification UniMod:35 with mass delta 15.9949 at M will be considered as variable
Modification UniMod:1 with mass delta 42.0106 at *n will be considered as variable
Existing .quant files will be used
Neural networks will be used for peak selection
Protein inference will not be performed
A spectral library will be created from the DIA runs and used to reanalyse them; .quant files will only be saved to disk during the first step
When generating a spectral library, in silico predicted spectra will be retained if deemed more reliable than experimental ones
Fixed-width center of each elution peak will be used for quantification
DIA-NN will optimise the mass accuracy automatically using the first run in the experiment. This is useful primarily for quick initial analyses, when it is not yet known which mass accuracy setting works best for a particular acquisition scheme.
The following variable modifications will be scored: UniMod:1
WARNING: double-pass mode is incompatible with PTM scoring, turned off

15 files will be processed
[0:00] Loading spectral library D:\Fragpipe\DDA-inflata_spec-lib\library.tsv
[0:06] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[0:06] Spectral library loaded: 10803 protein isoforms, 10803 protein groups and 95435 precursors in 80400 elution groups.
[0:06] Loading FASTA D:\R0270\Petunia fasta files\Petunia_inflata_v1.0.1_proteins.fasta
[22:13] Reannotating library precursors with information from the FASTA database
[22:14] Finding proteotypic peptides (assuming that the list of UniProt ids provided for each peptide is complete)
[22:14] 95435 precursors generated
[22:14] Protein names missing for some isoforms
[22:14] Gene names missing for some isoforms
[22:14] Library contains 2496 proteins, and 2496 genes
[22:14] Initialising library
[22:15] Saving the library to D:\Fragpipe\DDA-inflata_spec-lib\library.tsv.speclib

[22:15] First pass: generating a spectral library from DIA data
[22:15] File #1/15
[22:15] Loading run D:\R0270\rawDIA\20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S1_A1_1_754.d
For most diaPASEF datasets it is better to manually fix both the MS1 and MS2 mass accuracies to 10 ppm.
[25:01] 91284 library precursors are potentially detectable
[25:01] Processing...
[25:28] RT window set to 2.43923
[25:28] Ion mobility window set to 0.04
[25:28] Peak width: 6.268
[25:28] Scan window radius set to 13
[25:28] Recommended MS1 mass accuracy setting: 12.246 ppm
[25:53] Optimised mass accuracy: 15.0884 ppm
[27:01] Removing low confidence identifications
[27:01] Searching PTM decoys
[27:01] Removing interfering precursors
[27:07] Training neural networks: 86868 targets, 88387 decoys
[27:17] Number of IDs at 0.01 FDR: 57527
[27:18] Calculating protein q-values
[27:18] Number of genes identified at 1% FDR: 2217 (precursor-level), 2181 (protein-level) (inference performed using proteotypic peptides only)
[27:18] Quantification
[27:20] Precursors with monitored PTMs at 1% FDR: 272 out of 306
[27:20] Unmodified precursors with monitored PTM sites at 1% FDR: 243 out of 273
[27:23] Quantification information saved to D:\R0270\DIA-NN\220117 Inflata test/D__R0270_rawDIA_20211021_TIMS5_PRInLC1_PRI_P0096_R0270_120min_DIA_S1_A1_1_754_d.quant.

vdemichev · 2022-01-18T10:40:03Z

vdemichev
Jan 18, 2022
Maintainer

Hi Silas,

Most likely the FASTA file is not being read correctly. I guess it's not in UniProt format?

Best,
Vadim

5 replies

silasmellor Jan 18, 2022
Author

Thanks Vadim, i suspected something was wrong with the FASTA file. The file is from solgenomics, since the proteins have not been deposited in uniprot yet, so you're right, the format is different.

Is there any problem with not supplying a FASTA file and simply letting the program do the IDs from the spectral library only? From trying it looks like i get around 8.8-9k proteins by this method.

If i should attempt to generate a FASTA file in a uniprot-like format, do you know of a simple way to go about this? I would be curios to compare in-silico spectral library to the experimental one.

Best,
Silas

vdemichev Jan 18, 2022
Maintainer

Isoform IDs (like UniProt identifiers) are probably fine anyway. So if you are interested in those, it's OK, the final report will contain what you need, just switch protein inference to 'Isoforms'. Also, indeed, if you have a spectral library with proper protein groups already, can just turn protein group inference off in DIA-NN and don't need to supply a FASTA file.

silasmellor Jan 18, 2022
Author

Well at the moment that proteome is anyways only perfunctorily annotated, so simple ID's are fine, but i am unsure what part of the current headers might be causing issues. At the moment the headers look like this:

">Peinf101Scf00071g13014.1 Chalcone--flavonone isomerase A
MSPSVSVTEMHVENYVFAPTVNPAGSSNTLFLAGAGHRGLEIQGKFVKFTAIGVYLEESAIPFLAEKWKG
KTPEELTDSVEFFRDVVTGPFEKFTRVTMILPLTGKQYSEKVAENCVAHWKGIGTYTDDEGRAIEKFLDV
FRSETFPPGASIMFTQSPLGSLTISFAKDDSLTGTANAVIENKQLSEAVLESIIGKHGVSPAAKCSLAER
VAELLKKSYAEEASVFGKPETEKSTIPVIGV"

The spectral library was made from DDApasef runs on the same samples using fragpipe and the same FASTA file, so i can't quite figure out why the fasta does not work properly in DIA-NN.

vdemichev Jan 18, 2022
Maintainer

Well, if you have a FragPipe library, just don't specify a FASTA and turn off protein inference in DIA-NN. This will solve the problem.

silasmellor Jan 18, 2022
Author

Thanks Vadim, I tried that and it works perfectly, i now get 8-9k IDs per sample. It is only because i am curious and would like to try the predicted library option as well i am asking. I will try to format the FASTA to look more like the uniprot format and see if that makes any difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower than expected number of proteins in spectral library #282

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Lower than expected number of proteins in spectral library #282

silasmellor Jan 17, 2022

Replies: 1 comment · 5 replies

vdemichev Jan 18, 2022 Maintainer

silasmellor Jan 18, 2022 Author

vdemichev Jan 18, 2022 Maintainer

silasmellor Jan 18, 2022 Author

vdemichev Jan 18, 2022 Maintainer

silasmellor Jan 18, 2022 Author

silasmellor
Jan 17, 2022

Replies: 1 comment 5 replies

vdemichev
Jan 18, 2022
Maintainer

silasmellor Jan 18, 2022
Author

vdemichev Jan 18, 2022
Maintainer

silasmellor Jan 18, 2022
Author

vdemichev Jan 18, 2022
Maintainer

silasmellor Jan 18, 2022
Author