-
Notifications
You must be signed in to change notification settings - Fork 4
/
MayInst_P6_day2_sec4.Rmd
343 lines (245 loc) · 13.5 KB
/
MayInst_P6_day2_sec4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
---
title: "Statistics for quantitative mass spectrometry - Day 2"
subtitle: "Section 4 : MSstatsTMT, introductioin to data and preprocessing"
author: "Ting Huang"
date: "5/7/2018"
output:
html_document:
self_contained: true
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Day 2 - Section 4 : MSstatsTMT, introductioin to data and preprocessing
## Objective
- Preprocessing steps to make required input format for MSstatsTMT from output from diverse output of spectral processing and peptide quantification tools.
- Make annotation file, based on experimental design.
***
## Data
- Controlled mixtures: Sigma UPS1 48 protein-mix were spiked at 4 different ratio into a SILAC-labelled HeLa lysate. The mixtures were measured by TMT 10-plexes.
- the peptide quantification data of controlled mixtures, processed by Proteome Discoverer and MaxQuant.
![](img/Experimental_design.png)
***
## Load MSstatsTMT
Load MSstatsTMT first. Then you are ready to start MSstats.
```{r, eval=T, echo=T, warning=F}
# install MSstatsTMT from bioconductor
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("MSstatsTMT")
library(MSstatsTMT) # make sure it is version 1.2.1
?MSstatsTMT
```
![](img/MSstatsTMT-workfolow.jpg)
***
## Allowable data formats
`MSstatsTMT` performs statistical analysis steps, that follow peptide identification and quantitation. Therefore, input to MSstatsTMT is the output of other software tools (such as `Proteome Discoverer`, `MaxQuant` and so on) that read raw spectral files
, identify and quantify peptide ions. The preferred structure of data for use in MSstatsTMT is a .csv file in a *long* format with at least 10 columns representing the following variables: **ProteinName**, **PeptideSequence**, **Charge**, **PSM**, **Channel**, **Condition**, **BioReplicate**, **Mixture**, **TechRepMixture**, **Intensity**. The variable names are fixed, but are case-insensitive.
```{r, eval=T, echo=F, warning=F}
head(input.pd)
```
***
Let's start preprocessing steps to make required input format for MSstatsTMT from output from diverse output of peptide quantification tools.
![](img/MSstatsTMT_workflow_preprocessing.jpg)
## Proteome Discoverer output
### Read data
The required input data is the PSM-level data generated by `Proteome Discoverer 2.2`.
We first load and access the dataset processed by `Proteome Discoverer`. The file name is *'spikedin_PSMs.txt'*.
```{r}
# Read output from Proteome Discoverer
raw.pd <- read.delim(file="data/data_ProteomeDiscoverer_TMT/spikedin_PSMs.txt")
```
```{r}
# Check the column names
colnames(raw.pd)
```
The column names are differently from required input. Let's do preliminary check for this input.
```{r}
# total number of unique protein name
proteins <- unique(raw.pd$Protein.Accessions)
length(proteins)
# show the spiked-in proteins
proteins[grepl("ups",proteins)]
# total number of unique peptide names
length(unique(raw.pd$Annotated.Sequence))
# unique Spectrum.File, which is TMT run.
unique(raw.pd$Spectrum.File)
```
### Prepare annotation file
`MSstatsTMT` can make inference for group comparison design. In a group comparison design, the conditions (e.g., disease states) are profiled across **non-overlapping sets of biological replicates (i.e., subjects)**. In this example there are 4 conditions, 0.125, 0.5, 0.667, 1 (in general the number of conditions can vary). There are 2 subjects per condition per MS run (in general an equal number of replicates per condition is not required). Besides 2 pooled control replicates per run for across TMT-plex normalizations. Totally, each mixture has 10 replicates. There are 5 mixtures and each mixture has 3 technical replicate runs (in general technical replicates are not required, and their number per sample may vary). Overall, in this example there are 5 × 3 = 15 mass spectrometry runs and 5 × 3 x 10 = 150 replicates.
`Mixture` | `Run`
-----------|---------
1 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
1 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
1 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_03.raw
2 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
2 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_02.raw
2 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture2_03.raw
3 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw
3 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_02.raw
3 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture3_03.raw
4 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_01.raw
4 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_02.raw
4 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture4_03.raw
5 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_01.raw
5 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_02.raw
5 | 161117_SILAC_HeLa_UPS1_TMT10_Mixture5_03.raw
Here show run **161117_SILAC_HeLa_UPS1_TMT10_Mixture1_01.raw** as an example,
`Run` | `Channel` | `BioReplicate` | `Condition`
-------|------------|----------------|-----------
1 | X127C | 1.X127C | 0.125
1 | X129N | 1.X129N | 0.125
1 | X128N | 1.X128N | 0.5
1 | X129C | 1.X129C | 0.5
1 | X127N | 1.X127N | 0.667
1 | X130C | 1.X130C | 0.667
1 | X128C | 1.X128C | 1
1 | X130N | 1.X130N | 1
1 | X126 | 1.X126 | Norm
1 | X131 | 1.X131 | Norm
The most important is that 1) `Run` is not order of spectral acquisition, but just unique MS run ID 2) if normalization between runs need to be done, then each run must have at least one `Norm` channel 3) `Channel` column in the annotation file should match with the corresponding channel names in in output of `Proteome Discoverer` 4) **Spectrum.File** in the PSM output of `Proteome Discoverer` should be the **Run** column in the annotation file 5) If one channel of one mixture doesn't have sample, put `None` under **Condition** and **BioReplicate** column .
Annotation information is required to fill in **Condition**, **TechRepMixture**, **Mixture** and **BioReplicate** for corresponding **Run** and **Channel** information. Users have to prepare as csv or txt file like 'PD_Annotation.csv', which includes **Run**, **Channel**, **Condition**, **BioReplicate**, **TechRepMixture** and **Mixture** information, and load it in R.
```{r}
annot.pd <- read.csv(file="data/data_ProteomeDiscoverer_TMT/PD_Annotation.csv")
head(annot.pd)
```
#### Prepare the file with run information
Raw Spectrum file name in the output of peptide quantification tool should be the Run information in the annotation file.
Let's start with **Spectrum.File** in PSM output of `Proteome Discoverer`.
```{r}
runs <- unique(raw.pd$Spectrum.File) # MS runs
runs
```
The run ID has the mixture and technical replicate information
Let's try to extract the mixture and technical replicate information.
```{r, message=FALSE, warning= FALSE}
library(tidyr)
library(dplyr)
Run_info <- data.frame(Run = runs) # initialize the run file
# use function separate()
?separate
## Add the mixture and technical replicate information
Run_info <- Run_info %>%
separate(Run, c("Mixture", "TechRepMixture"), sep="_0", remove = FALSE)
Run_info
```
Then let's clean the Mixture ID and TechRepMixture ID
```{r}
## clean the run file and add fraction information
Run_info <- Run_info %>%
mutate(Mixture = gsub("161117_SILAC_HeLa_UPS1_TMT10_", "", Mixture),
TechRepMixture = gsub(".raw", "", TechRepMixture),
Fraction = "F1")
Run_info
```
#### Prepare the file with group information
Let's add the group and biological replicate information for each channel in each mixture. The channel ID should be consistent with the reporter ion intensity columns in the PSM file of `Proteome Discoverer`.
```{r}
colnames(raw.pd)
## Channel ID in the PSM file
channels <- c("126", "127N", "127C", "128N", "128C", "129N", "129C", "130N", "130C", "131")
## Mixture ID from run file
mixtures <- unique(Run_info$Mixture)
mixtures
## create the file with channel information in each mixture
Group_info <- expand.grid(channels, mixtures)
colnames(Group_info) <- c("Channel", "Mixture")
head(Group_info)
## save the channel file and fill in the condition and biological replicate information manually
write.csv(Group_info, file = "Group_info.csv", row.names = FALSE)
## Now the condition information should be available in the file
Group_info_filled <- read.csv(file = "data/data_ProteomeDiscoverer_TMT/Group_info_pd.csv")
head(Group_info_filled)
```
#### Prepare the final annotation file
Let's generate the annotation file from run file and group file
```{r}
annotation <- full_join(Run_info, Group_info_filled)
nrow(annotation)
head(annotation)
```
### Preprocessing with `PDtoMSstatsTMTFormat`
The input data for `MSstatsTMT` is required to contain variables of **ProteinName**, **PeptideSequence**, **Charge**, **PSM**, **Channel**, **Condition**, **BioReplicate**, **TechRepMixture**, **Mixture**, **Intensity**. These variable names should be fixed. Output from Proteome Discoverer have different columns from the required input of `MSstatsTMT`. `PDtoMSstatsTMTFormat ` function helps pre-processing for making right format of MSstatsTMT input from Proteome Discoverer output. For example, it renames some column name, and remove shared peptides.
Here is the summary of pre-processing steps in `PDtoMSstatsTMTFormat` function.
+ Peptide ions which are shared by more than one protein are removed
+ If one spectrum has multiple identifications within one run, it only keeps the best identification which has most measurements within that run or highest identification score or largest summation of all the measurements within that run
+ If one peptide ion has any missing value within one run, it removes the peptide ion from that run
+ For fractionation, it removes the shared peptide ions among the fractions of each mixture and then combine all the fractions for each mixture
For further details, visit the help file using the following code.
```{r, eval=F}
?PDtoMSstatsTMTFormat
```
```{r, message=F, warning=F}
# reformating and pre-processing for PD output.
input.pd <- PDtoMSstatsTMTFormat(raw.pd, annotation=annotation)
head(input.pd)
```
### Preliminary check
```{r}
# total number of proteins
proteins <- unique(input.pd$ProteinName)
length(proteins)
# show the spiked-in proteins
proteins[grepl("ups",proteins)]
```
### Save your work
We can save the data that we made so far.
```{r}
save(input.pd, file='data/data_ProteomeDiscoverer_TMT/input.pd.rda')
#write.csv(input.pd, file='data/data_ProteomeDiscoverer_TMT/input.pd.csv', row.names = FALSE)
```
***
## MaxQuant output
### Read data
Three files should be prepared before MSstatsTMT. Two files, ‘proteinGroups.txt’ and ‘evidence.txt’ are outputs from MaxQuant.
```{r}
# First, get protein ID information
proteinGroups <- read.table("data/data_MaxQuant_TMT/proteinGroups.txt", sep = "\t", header = TRUE)
```
```{r}
# Read in MaxQuant file: evidence.txt
evi <- read.table("data/data_MaxQuant_TMT/evidence.txt", sep="\t", header=TRUE)
colnames(evi)
unique(evi$Raw.file)
```
Again, we need file annotation file, required to fill in Condition, BioReplicate and Mixture for corresponding Run and Channel information. Users have to prepare as csv or txt file like ‘MaxQuant_annotation.csv’, which includes **Run**, **Channel**, **Condition**, **BioReplicate**, **TechRepMixture** and **Mixture** information, and load it in R.
### Set annotation file
**Run** column in the annotation file should be the same as unique **Raw.file** in evidence.txt file. **Channel** column in the annotation file should match with the corresponding channel names in evidence.txt file.
```{r}
# Read in annotation including condition and biological replicates: MaxQuant_annotation.csv
annot.maxquant <- read.csv("data/data_MaxQuant_TMT/MaxQuant_annotation.csv", header = TRUE)
head(annot.maxquant)
```
### Preprocessing with `MaxQtoMSstatsTMTFormat`
`MaxQtoMSstatsTMTFormat` function helps pre-processing for making right format of MSstatsTMT input from MaxQuant output. Basically, this function gets peptide ion intensity from `‘evidence.txt’` file. In addition, there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in `MaxQtoMSstatsTMTFormat` function
+ Use `Proteins` in ‘proteinGroups.txt’ as protein ID
+ Peptide ions which are shared by more than one protein are removed
+ If one spectrum has multiple identifications within one run, it only keeps the best identification which has most measurements within that run or highest identification score or largest summation of all the measurements within that run
+ If one peptide ion has any missing value within one run, it removes the peptide ion from that run
+ For fractionation, it removes the shared peptide ions among the fractions of each mixture and then combine all the fractions for each mixture
```{r, eval=F}
?MaxQtoMSstatsTMTFormat
```
```{r, message=F, warning=F}
# reformating and pre-processing for MaxQuant output.
input.maxquant <- MaxQtoMSstatsTMTFormat(evidence=evi,
annotation=annot.maxquant,
proteinGroups=proteinGroups)
head(input.maxquant)
```
### Preliminary check
```{r}
# total number of proteins
proteins <- unique(input.maxquant$ProteinName)
length(proteins)
# show the spiked-in proteins
proteins[grepl("ups",proteins)]
```
### Save your work
We can save the data that we made so far.
```{r}
save(input.maxquant, file='data/data_MaxQuant_TMT/input.maxquant.rda')
```