-
Notifications
You must be signed in to change notification settings - Fork 4
/
MayInst_P6_day1_sec1.Rmd
426 lines (272 loc) · 16.1 KB
/
MayInst_P6_day1_sec1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
# Day 1 - Section 1 : MSstats, introduction to data and preprocessing
## Objective
- Preprocessing steps to make required input format for MSstats from output from diverse output of spectral processing tools.
- Make annotation file, based on experimental design.
***
## Workflow in MSstats
![](img/MSstats_intro1.png)
![](img/MSstats_intro2.png)
![](img/MSstats_intro3.png)
***
## Data
- the quantified peak intensities data from ABRF 2015, processed by Skyline.
![Summary of the ABRF 2015 study design, Choi, et al. J Proteome Res. 2017](img/Protein_sample_information.png){width=70%}
***
## Load MSstats
Load MSstats first. Then you are ready to start MSstats.
```{r, eval=F, echo=T, warning=F}
library(MSstats)
?MSstats
```
```{r, eval=T, echo=F, warning=F}
#library(MSstats, warn.conflicts = F, quietly = T, verbose = F)
library(MSstats)
?MSstats
```
***
## Allowable data formats
`MSstats` performs statistical analysis steps, that follow peak identification and quantitation. Therefore, input
to MSstats is the output of other software tools (such as `Skyline`, `MaxQuant` and so on) that read raw spectral files
, identify and quantify spectral peaks. The preferred structure of data for use in MSstats is a .csv file
in a *long* format with at least 10 columns representing the following variables: **ProteinName**, **PeptideSequence**, **PrecursorCharge**, **FragmentIon**, **ProductCharge**, **IsotopeLabelType**, **Condition**, **BioReplicate**, **Run**, **Intensity**. The variable names are fixed, but are case-insensitive.
```{r, eval=T, echo=F, warning=F}
head(DDARawData)
```
***
## Convert to MSstats required format (Data cleaning)
***
Let's start preprocessing steps to make required input format for MSstats from output from diverse output of spectral processing tools.
![](img/MSstats_intro4.png)
- Data input support for various data acquisition methods : DDA, DIA, SRM
- Interoperability with existing computational tools :
- Converter functions for 7 data processin tools :
`SkylinetoMSstatsFormat`, `MaxQtoMSstatsFormat`, `OpenMStoMSstatsFormat`, `ProgenesistoMSstatsFormat`, `PDtoMSstatsFormat`, `SpectronauttoMSstatsFormat`, `OpenSWATHtoMSstatsFormat`, `DIAUmpiretoMSstatsFormat`
- Consistent data cleaning steps across converter functions : filter multiple measurements, shared peptides, etc.
- Generate the same format of data from diverse type of data format.
### Skyline output
#### Read data
The required input data is generated automatically when using `MSstats` report format in `Skyline`.
We first load and access the dataset processed by `Skyline`. The name of saved file from `Skyline` using `MSstats report format` is *'ABRF2015_Skyline_report.csv'*. or you can use the published data from this link (https://panoramaweb.org/labkey/project/MacCoss/brendan/manuscripts/iPRG%202015/begin.view). This example dataset is the exactly same data in 'iPRG_10ppm_2rt_15cut_nosingle.csv' from the link above.
```{r}
# Read output from skyline
raw.skyline <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_report.csv")
```
```{r}
# Check the first 6 rows of dataset
head(raw.skyline)
```
There are some column named differently than required input. The information for `Condition` and `BioReplicate` is missing. Let's do preliminary check for this input.
```{r}
# total number of unique protein name
length(unique(raw.skyline$Protein))
# several isotopic peaks for peptide charge
unique(raw.skyline$FragmentIon)
# unique FileName, which is MS run.
unique(raw.skyline$FileName)
# 'Truncated' column
unique(raw.skyline$Truncated)
# count table for 'Truncated' column
xtabs(~Truncated, raw.skyline)
# count which 'Truncated' is 'True'
sum(raw.skyline$Truncated == 'True')
```
#### Set annotation file
Annotation information is required to fill in **Condition** and **BioReplicate** for corresponding **Run** information. Users have to prepare as csv or txt file like 'ABRF2015_Skyline_annotation.csv', which includes **Run**, **Condition**, and **BioReplicate** information, and load it in R.
```{r}
annot.skyline <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_annotation.csv")
annot.skyline
```
##### Common mistake for annotation file : Incorrect Run information
Raw file name in the output of spectral processing tool is commonly long. We can make a typo in annotation file. If Raw file name and Run information are not matched,
Let's check whether **Run** information is same as **File.Name** in output of `Skyline`.
```{r}
setdiff(unique(raw.skyline$FileName), annot.skyline$Run)
setdiff(annot.skyline$Run, unique(raw.skyline$FileName))
```
There is no issue in this annotation.
Let's try the example with unmatched information.
```{r}
annot.wrong <- read.csv(file="data/data_Skyline/ABRF2015_Skyline_annotation_wrong_example.csv")
annot.wrong
```
**Note!** small details : hyphen vs underline.
##### Common mistake for annotation file : Incorrect `BioReplicate` information
`MSstats` distinguish the design of experiment as group comparison, time course, paired design, with the combination of `Condition`, `BioReplicate`, and `Run`.
###### Group comparison
![](img/expdesign_group.png){width=40%}
In a group comparison design, the conditions (e.g., disease states) are profiled across **non-overlapping sets of biological replicates (i.e., subjects)**. In this example there are 2 conditions, Disease and Control (in general the number of conditions can vary). There are 3 subjects (i.e., biological replicates) per condition (in general an equal number of replicates per condition is not required). Overall, in this example there are 2 × 3 = 6 mass spectrometry runs.
The most important is that 1) subject IDs for disease group are completely different thatn subject IDs for control group 2) `Run` is not order of spectral acquisition, but just unique MS run ID.
![](img/expdesign_group_tech.png)
In addition, if each subject has e technical replicate runs (in general technical replicates are not required, and their number per sample may vary). there are 2 × 3 × 3 = 18 mass spectrometry runs.
###### Time course
The important feature of a time course experimental design is that **a same subject (i.e., biological replicate) is repetitively measured across multiple time points**. In this example there are 3 time points, Time1, Time2, and Time3 (in general the number of times can vary). There are 3 subjects (i.e., biological replicates) measured across times (in general an equal number of times per replicate is not required). There are no technical replicates (in general the number of technical replicates per sample may vary). Overall, in this example there are 3 × 3 = 9 mass spectrometry runs.
![](img/expdesign_time.png){width=60%}
###### Paired design
Another frequently used experimental design is a *paired design*, where measurements from **multiple conditions (such as healthy biopsy and disease biopsy) are taken from a same subject**. The statistical model for this experimental design is the same as in the time course experiment, however the values in the columns of the input data may have a different appearence. In this example there are 3 subjects (in general the number of patients can vary). There are three conditions per subject (in general the number of conditions per subject can exceed two). Overall, in this example there are $2 \times 3 = 6$ mass spectrometry runs. `BioReplicate` should indicate each individual ID.
![](img/expdesign_pair.png){width=60%}
#### Preprocessing with `SkylinetoMSstatsFormat`
The input data for `MSstats` is required to contain variables of **ProteinName**, **PeptideSequence**, **PrecursorCharge**, **FragmentIon**, **ProductCharge**, **IsotopeLabelType**, **Condition**, **BioReplicate**, **Run**, **Intensity**. These variable names should be fixed. `MSstats` input from Skyline adapts the column scheme of the dataset so that it fits `MSstats` input format. However there are several extra column names and also some of them need to be changed. `SkylinetoMSstatsFormat` function helps pre-processing for making right format of MSstats input from Skyline output. For example, it renames some column name, and replace truncated peak intensities with NA. Another important step is to handle isotopic peaks before using `dataProcess`. The output from Skyline for DDA experiment has several measurements of peak area from the monoisotopic, M+1 and M+2 peaks. To get a robust measure of peptide intensity, we can sum over isotopic peaks per peptide or use the highest peak. Here we take a summation per peptide ion.
Here is the summary of pre-processing steps in `SkylinetoMSstatsFormat` function (in orange box below).
![](img/MSstats_Skyline.png)
For further details, visit the help file using the following code.
```{r, eval=F}
?SkylinetoMSstatsFormat
```
```{r}
# reformating and pre-processing for Skyline output.
input.skyline <- SkylinetoMSstatsFormat(raw.skyline,
annotation=annot.skyline,
removeProtein_with1Feature = TRUE)
head(input.skyline)
```
#### Preliminary check
```{r}
length(unique(input.skyline$ProteinName))
sum(is.na(input.skyline$Intensity))
sum(!is.na(input.skyline$Intensity) & input.skyline$Intensity==0)
```
***
### MaxQuant output
#### Read data
Three files should be prepared before MSstats. Two files, ‘proteinGroups.txt’ and ‘evidence.txt’ are outputs
from MaxQuant.
```{r}
# First, get protein ID information
proteinGroups <- read.table("data/data_MaxQuant/proteinGroups.txt", sep = "\t", header = TRUE)
```
```{r}
# Read in MaxQuant file: evidence.txt
evi <- read.table("data/data_MaxQuant/evidence.txt", sep="\t", header=TRUE)
colnames(evi)
unique(evi$Raw.file)
```
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_MaxQuant_annotation.csv’, which includes **Run**, **Condition**, and **BioReplicate** information, and load it in R.
#### Set annotation file
Annotation information is required to fill in **Condition** and **BioReplicate** for corresponding **Raw.file** information. Users have to prepare as csv or txt file like 'ABRF2015_MaxQuant_annotation.csv', which includes **Raw.file**, **Condition**, and **BioReplicate** information, and load it in R. **Raw.file** column in the annotation file should be the same as unique **Raw.file** in evidence.txt file.
```{r}
# Read in annotation including condition and biological replicates: annotation.csv
annot.maxquant <- read.csv("data/data_MaxQuant/ABRF2015_MaxQuant_annotation.csv", header = TRUE)
annot.maxquant
```
#### Preprocessing with `MaxQtoMSstatsFormat`
`MaxQtoMSstatsFormat` function helps pre-processing for making right format of MSstats input from MaxQuant output. Basically, this function gets peptide ion intensity from `‘evidence.txt’` file. In addition, there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in `MaxQtoMSstatsFormat` function (in orange box below).
![](img/MSstats_MaxQ.png)
```{r, eval=F}
?MaxQtoMSstatsFormat
```
```{r}
# reformating and pre-processing for MaxQuant output.
# no protein with 1 peptide
input.maxquant <- MaxQtoMSstatsFormat(evidence=evi,
annotation=annot.maxquant,
proteinGroups=proteinGroups,
removeProtein_with1Peptide=TRUE)
head(input.maxquant)
```
#### Preliminary check
```{r}
length(unique(input.maxquant$ProteinName))
sum(is.na(input.maxquant$Intensity))
sum(!is.na(input.maxquant$Intensity) & input.maxquant$Intensity==0)
```
> Challenge
>
> * Let's check unique information for `Run`,`BioReplicate` and `Condition` in `input.maxquant`.
```{r}
unique(input.maxquant[, c('Run', 'BioReplicate', 'Condition')])
```
***
### OpenMS output
#### Read data
```{r}
# First, read output of OpenMS
raw.openMS <- read.csv("data/data_OpenMS/ABRF2015_OpenMS_raw.csv", stringsAsFactors=F) # the data file
head(raw.openMS)
```
#### Set annotation file
The output from `OpenMS` already includes `Run`, `BioReplicate`, `Condition` information. Let's check it.
```{r}
unique(raw.openMS[, c('Run', 'BioReplicate', 'Condition')])
```
#### Preprocessing with `OpenMStoMSstatsFormat`
There are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in `OpenMStoMSstatsFormat` function.
![](img/MSstats_OpenMS.png)
```{r, eval=F}
?OpenMStoMSstatsFormat
```
```{r}
# reformating and pre-processing for OpenMS output.
input.openms <- OpenMStoMSstatsFormat(raw.openMS,
removeProtein_with1Feature=TRUE)
## now 'input.openms' is ready for MSstats
head(input.openms)
```
#### Preliminary check for preprocessed data
```{r}
length(unique(input.openms$ProteinName))
sum(is.na(input.openms$Intensity))
sum(!is.na(input.openms$Intensity) & input.openms$Intensity==0)
table(input.openms$Run)
```
> Challenge
>
> * Let's check unique information for `Run`,`BioReplicate` and `Condition`.
```{r}
unique(input.openms[, c('Run', 'BioReplicate', 'Condition')])
```
***
### Progenesis output
#### Read data
```{r}
# First, read output of Progenesis
raw.progenesis <- read.csv("data/data_Progenesis/ABRF2015_Progenesis_raw.csv", stringsAsFactors=F) # the data file
head(raw.progenesis)
```
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run information. Users have to prepare as csv or txt file like ‘ABRF2015_Progenesis_annotation.csv’, which includes **Run**, **Condition**, and **BioReplicate** information, and load it in R.
#### Set annotation file
Annotation information is required to fill in **Condition** and **BioReplicate** for corresponding **Run** information. Users have to prepare as csv or txt file like 'ABRF2015_Progenesis_annotation.csv', which includes **Run**, **Condition**, and **BioReplicate** information, and load it in R.
```{r}
## Read in annotation including condition and biological replicates: ABRF2015_Progenesis_annotation.csv
annot.progenesis <- read.csv("data/data_Progenesis/ABRF2015_Progenesis_annotation.csv", header = TRUE)
annot.progenesis
```
!! `Run` information should be the same as annotated in output of `Progenesis`.
#### Preprocessing with `ProgenesistoMSstatsFormat`
The output from `Progenesis` includes peptide ion-level quantification for each MS runs. `ProgenesistoMSstatsFormat` function helps pre-processing for making right format of `MSstats` input from `Progenesis` output. Basically, this function reformats wide format to long format. It provide **Raw.abundance**, **Normalized.abundance** and **Spectral count** columns. This converter uses **Raw.abundance** columns for Intensity values. In addition, there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in `ProgenesistoMSstatsFormat` function (in orange box below).
![](img/MSstats_progenesis.png)
```{r, eval=F}
?ProgenesistoMSstatsFormat
```
```{r}
# reformating and pre-processing for Progenesis output.
input.progenesis <- ProgenesistoMSstatsFormat(raw.progenesis,
annotation=annot.progenesis,
removeProtein_with1Peptide=TRUE)
## now 'input.progenesis' is ready for MSstats
head(input.progenesis)
```
#### Preliminary check for preprocessed data
```{r}
length(unique(input.progenesis$ProteinName))
sum(is.na(input.progenesis$Intensity))
sum(!is.na(input.progenesis$Intensity) & input.progenesis$Intensity==0)
table(input.progenesis$Run)
```
> Challenge
>
> * Let's check unique information for `Run`,`BioReplicate` and `Condition`.
```{r}
unique(input.progenesis[, c('Run', 'BioReplicate', 'Condition')])
```
***
## Save your work
We can save the data that we made so far.
```{r}
save(input.skyline, file='data/data_Skyline/input.skyline.rda')
save(input.maxquant, file='data/data_MaxQuant/input.maxquant.rda')
save(input.openms, file='data/data_OpenMS/input.openms.rda')
save(input.progenesis, file='data/data_Progenesis/input.progenesis.rda')
```