This is a repository for algorithms exploring untargeted and targeted detection, extraction, and characterisation of tryptic peptide features in raw MS1 data produced by the timsTOF mass spectrometer for LC-MS/MS proteomics experiments.
There are two approaches to peptide feature detection for the timsTOF in this repository.
A DDA analysis pipeline where the first phase processes one or more runs at a time and detects peptide features using the instrument isolation windows as a starting point (targeted feature detection). It builds a library of peptides identified in at least one run. The second phase uses the peptide library to build machine learning models that predict the 3D coordinates for each peptide in the library. It then extracts them and decoys to control the FDR (targeted extraction). Code is here.
3DID is a de novo MS1 feature detector that uses the characteristic structure of peptides in 4D to detect and segment features for identification. Code is here.
- Jupyter notebooks for generating the figures for the papers and some other visualisations.
The code has been tested with the runs deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD030706 and 10.6019/PXD030706 here. Other timsTOF raw data should work as well.
The code has been tested on a PC with a 12-core Intel i7 6850K processor and 64 GB of memory running Ubuntu 20.04. It will run faster with more cores and more memory that will allow it to increase the parallelism. The pipeline will automatically detect the hardware environment and utilise resources according to the specified proportion_of_cores_to_use
value.
- Follow the installation instructions here.
- Create a Python 3.8 environment with
conda create -n [name] python=3.8
- Activate the environment with
conda activate [name]
Follow the installation instructions here.
- Clone the repository with
git clone git@github.com:WEHI-Proteomics/tfde.git
. - Install the required packages with
pip install -r ./tfde/requirements.txt
.
- Create a directory for the group of experiments. For example,
/media/big-ssd/experiments
. This is called the experiment base directory. All the intermediate artefacts and results produced by the pipeline will be stored in subdirectories created automatically under this directory. - Under the experiment base directory, create a directory for each experiment. For example,
P3856_YHE010
for the human-only data. - The pipeline expects the raw
.d
directories to be in a directory calledraw-databases
under the experiment directoy. Either copy the.d
directories here, or save storage by creating symlinks to them. For example, the .d directories have been downloaded to/media/timstof-output
, the symlinks can be created like this:cd /media/big-ssd/experiments/P3856_YHE010/raw-databases
ln -s /media/timstof-output/* .
- Edit the
./tfde/pipeline/bulk-run.sh
bash script to process the groups of technical replicates of the experiment. These are the runs that will be used to build the peptide library and from which the library peptides will be extracted. Be sure to specify the experiment base directory with the-eb
flag, which has the value/media/big-ssd/experiments
by default. - Execute the pipeline with
./tfde/pipeline/bulk-run.sh
. Progress information is printed to stdout. Analysis will take a number of hours, depending on the complexity of the samples, the number of runs in the experiment, the length of the LC gradient, and the computing resources of the machine. It's convenient to use a command like this for long-running processes:nohup ./tfde/pipeline/bulk-run.sh > tfde.log 2>&1 &
. - The results are stored in a SQLite database called
results.sqlite
in thesummarised-results
directory. This database includes the peptides identified and extracted, the runs from which they were identified and extracted, and the proteins inferred. Examples of how to extract data from the results schema are in thenotebooks
directory.
- 3DID uses the TFD/E experiments directory structure, but has its own execute command. So from step 4 in the TFD/E instructions, replace with the following.
- Edit the
./tfde/3did/variable-minvi.sh
bash script to analyse a run from the experiment at different threshold depth specified by theminvi
parameter. Alternatively, you can use theexecute.py
script directly. - Execute with
./tfde/3did/variable-minvi.sh
. - The results are copied to a subdirectory under
/media/big-ssd
by default. Within this structure, the results are stored in a Feather file called{features_dir}/exp-{experiment_name}-run-{run_name}-features-3did-dedup.feather
. Examples of how to extract data from this file are in thenotebooks
directory.
- The feature classification step required a pre-trained model to be present. To train the model, use the
train the feature classifier
notebook. The notebook uses TensorFlow and CUDA, so your CUDA environment should be set up beforehand. These days it's much easier to set up CUDA using a Docker container.
If you find TFD/E or 3DID useful, please cite our papers.
- Wilding-McBride D, Dagley LF, Spall SK, Infusini G, Webb AI. Simplifying MS1 and MS2 spectra to achieve lower mass error, more dynamic range, and higher peptide identification confidence on the Bruker timsTOF Pro. PLOS ONE. 2022;17(7):e0271025. doi:10.1371/journal.pone.0271025
- Wilding-McBride D, Infusini G, Webb AI. Predicting coordinates of peptide features in raw timsTOF data with machine learning for targeted extraction reduces missing values in label-free DDA LC-MS/MS proteomics experiments. Published online May 2, 2022:2022.04.25.489464. doi:10.1101/2022.04.25.489464
- Wilding-McBride D, Webb AI. A de novo MS1 feature detector for the Bruker timsTOF Pro. PLOS ONE. 2022;17(11):e0277122. doi:10.1371/journal.pone.0277122