proteolizard-algorithm

A collection of algorithms and tooling to process ion-mobility mass-spectrometry raw-data

This repository is part of the proteolizard project, a free and open-source solution for raw-data access, algorithms and raw-data visualization of mass spectrometry data generated with the bruker timsTOF device.

We are a relatively small team of developers and therefore decided to keep things loosely coupled. This means that

data access : proteolizard-data
algorithms : proteolizard-algorithm
visualization: proteolizard-vis

are made available at different repositories. This makes it easier for us to develop all pieces independently. We try to keep dependencies as small as possible, which should allow you to exchange parts of your custom pipelines against other data-access backends such as timspy or alphatims.

Development is still ongoing. If you experience weird behaviour, bugs or errors please let us know!

Why proteolizard-algorithm?

proteolizard-algorithm provides you with algorithms and tools that are tailored to deal with the huge amount of raw-data generated by liquid chromatography coupled to ion-mobility tandem mass-spectrometry (LC-IMS-MS-MS). The additional recording of ion-mobility adds another dimension to experiments while data-sparsity increases as well. This makes a lot of traditional approaches used for LC-MS-MS processing either too slow or their design unsuited for these datasets.

Our goal is to translate ideas developed in other disciplines in data science that have to deal with related problems. We especially want to make use of modern hardware such as multicore systems and GPU parallelization.

Navigation

Build and install proteolizard-algorithm
Locality Sensitive Hashing (LSH)
Clustering
Supervised (Deep) Learning

Build and install proteolizard-algorithm

We highly recommend to install all libraries that are part of the proteolizard project into a python virtual environment or conda environment.

To use proteolizatd-algorithm, you will need to install proteolizard-data first. After that, build the C++ shared library for python:

shell> git clone https://github.com/theGreatHerrLebert/proteolizard-algorithm
shell> cd proteolizard-algorithm

shell> mkdir build && cd build
shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release
shell> make

Or, if you did not install proteolizard-data into a global install directory, you also need to set CMAKE_PREFIX_PATH to the same installation prefix used for proteolizard-data:

shell> mkdir build && cd build
shell> cmake ../cpp -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH=path/to/proteolizard-data/install
shell> make

shell> cmake --install . --prefix=some/prefix/path

Locality Sensitive Hashing (LSH)

LSH is a stochastic technique to find similar objects, where similarity is estimated using a family of hash functions that are tailored to approximate some similarity measure. One of its main advantages over other algorithms is the fact that similar pairs can be found in linear time, trading a guarantee to find all similar objects against high probability of detection.

proteolizard-algorithm implements approximation of cosine similarity of mass spectra. To do so, it allows you to generate a set of keys for mz spectra in a vectorized representation. This keys can then be used for detection of self-collision, reference search or generally anything related to distance matrices like clustering. Key calculation is based on tensorflow Tensors and can therefore be put onto the GPU if you have a CUDA enabled NVIDIA card and cuDNN is available in your environment.

We will briefly go over how LSH is performed for timsTOF data.

TODO: explain and show workflow plot.

If you want to learn more about LSH in context of mass spectrometry, have a look at Bob et al.¹ or Wang et al.²³

import numpy as np
import tensorflow as tf

from proteolizarddata.data import PyTimsDataHandle, TimsFrame, MzSpectrum
from proteolizardalgo.hashing import TimsHasher, IsotopeReferenceSearch, ReferencePattern
from proteolizardalgo.utility import create_reference_dict, get_refspec_list, get_ref_pattern_as_spectra

# create a data handle and read a precursor frame
dh = PyTimsDataHandle('/path/to/data.d')
frame = dh.get_frame(dh.precursor_frames[250])

# create a set of dense windows indexed by scan and mz-bin
scan, mz_bin, W = frame.get_dense_windows(window_length=4, resolution=2, min_peaks=5, 
                                          min_intensity=50, overlapping=True)

# create a spectrum hasher
# by picking a fixed seed, you can guarantee that keys can be reproduced
hasher = TimsHasher(trials=256, len_trial=22, seed=42, num_dalton=4, resolution=2)

# calculate trials number of keys, each having len_tral bits for each window
K = hasher.calculate_keys(W)

print(K)

This will give you:

<tf.Tensor: shape=(10682, 512), dtype=int32, numpy=
array([[ 362167, 3700797, 3061941, ..., 1147456, 1968934,   98534],
       [2538463, 3497250, 2595794, ..., 2643667, 2048648, 3815282],
       [2003423, 3821990, 2528830, ..., 1697390, 1763353, 1735530],
       ...,
       [2898374, 1166177, 1438584, ..., 2115578,  769518,  448939],
       [1382299, 3202454, 3824606, ..., 2843920, 1615614, 3689973],
       [ 877019, 3258715, 4001803, ..., 1603336, 2742681, 2790119]],
      dtype=int32)>

where shape = (number_windows, number_keys_per_window).

Clustering

DUMMY

Supervised (Deep) Learning

Zohora et al.⁴⁵

Locality-sensitive hashing enables efficient and scalable signal classification in high-throughput mass spectrometry raw data. BMC Bioinformatics, 2022. https://doi.org/10.1186/s12859-022-04833-5 ↩
A Fast and Memory-Efficient Spectral Library Search Algorithm Using Locality-Sensitive Hashing. Proteomics, 2020. https://doi.org/10.1002/pmic.202000002 ↩
msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing. journal of proteome, 2019. https://pubs.acs.org/doi/10.1021/acs.jproteome.8b00448 ↩
DeepIso: A Deep Learning Model for Peptide Feature Detection from LC-MS map. Nature scientific reports, 2019. https://doi.org/10.1038/s41598-019-52954-4 ↩
Deep neural network for detecting arbitrary precision peptide features through attention based segmentation. Nature scientific reports, 2021. https://doi.org/10.1038/s41598-021-97669-7 ↩

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
cpp		cpp
python		python
.gitignore		.gitignore
LICENSE		LICENSE
ProteolizardLogo.png		ProteolizardLogo.png
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proteolizard-algorithm

A collection of algorithms and tooling to process ion-mobility mass-spectrometry raw-data

Why proteolizard-algorithm?

Navigation

Build and install proteolizard-algorithm

Locality Sensitive Hashing (LSH)

Clustering

Supervised (Deep) Learning

About

Releases

Packages

Contributors 3

Languages

License

theGreatHerrLebert/proteolizard-algorithm

Folders and files

Latest commit

History

Repository files navigation

proteolizard-algorithm

A collection of algorithms and tooling to process ion-mobility mass-spectrometry raw-data

Why proteolizard-algorithm?

Navigation

Build and install proteolizard-algorithm

Locality Sensitive Hashing (LSH)

Clustering

Supervised (Deep) Learning

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages