Skip to content

A simulation based framework to estimate the evolutionary traceability of protein.

Notifications You must be signed in to change notification settings

BIONF/protTrace

Repository files navigation

protTrace - A simulation based framework to estimate the evolutionary traceability of protein.

language: Python presented at: GCB2018 published in: BioRxiv license: GPL-3.0

Table of Contents

Scientific context

ProtTrace is a simulation based approach to assess for a protein, the seed, over what evolutionary distances its orthologs can be found by means of sharing a significant sequence similarity. By doing so, it helps to differentiate between the true absence of an ortholog in a given species, and its non-detection due to a limited search sensitivity. ProtTrace was presented 2018 at the German Conference on Bioinformatics (GCB). The high resolution PDF of the corresponding poster is available from HERE. Add Text

Workflow

The workflow of protTrace to infer the evolutionary traceability of a seed protein is shown in the figure below (mouse over to see details). It consists of three main steps

  1. Parameterization: The compilation of an orthologous group for this protein. In the standard setting, OMA orthologous groups are used. The sequences in the ortholog group are then used to infer the parameters of substitution and the insertion- and deletion process.
  2. Traceability calculation: The in-silico evolution of the seed protein using the simulation software REvolver, and the determination of the traceability curve.
  3. Visualization: The inference of the traceability index for the protein in 233 species from all domains of life, and the generation of a colored tree. A high resolution PDF of the image is available HERE.

Alt Text

Installation & Usage

Please refer to the protTrace WIKI for a full description of the installation and usage guidlines. The WIKI will also explain how to set up a virtual machine running protTrace. Below, we will provide a quick excerpt.

protTrace is written in Python 2.7, some helper scripts in Perl and R. Find below a the 3rd party software that is required by protTrace:

  • The ProtTrace package contains scripts written in different languages. In order to run ProtTrace you need the following resources:
  • Python v2.7.13 or higher. Note, ProtTrace will not run under Python 3
  • Perl v5 or higher including the following modules
    • Getopt::Long
    • List::Util
    • LWP::Simple
  • Java v1.7 or higher
  • R v3 or higher
  • wget

protTrace & Accessory Software

Program name Version Description Mandatory BioConda
MAFFT v6 or higher Multiple Sequence alignment yes yes
NCBI Blast v2.7 or higher Sequence similarity based search yes yes
HMMER 3.2 or higher Sequence similarity based search using Hidden Markov Mode yes yes
IQTREE 1.6.7.1 or higher Phylogenetic tree reconstruction yes yes
HaMStR OneSeq v1 or higher targeted ortholog search no no

For the start, we suggest to omit the optional use of HaMStR, since the use of this software comes along with some strict naming conventions.

Once that is out of the way (we suggest to use the conda package management system for this) you can just clone this repository to get a copy of protTrace.

git clone https://github.com/BIONF/protTrace

Configuring protTrace

To configure protTrace simply move into the protTrace directory and run the configure script

perl bin/create_conf.pl -name=prog.conf -getOMA -getPfam

This will check if all dependencies are existing, it will allow you to set all parameters required for the protTrace run, and eventually will download the required data from the OMA database and from the Pfam database. * If you are confident that you have this data already available, you can omit either or both of the options -getOMA and -getPfam. You will then have to tell protTrace via the create_conf.pl script where this data is located. * Make sure to adhere to the formatting requirements for the OMA data, and that you ran hmmpress on the Pfam database.

Once everything is set, you are ready to run protTest

Calling protTest

Enter the protTest directory and type

python bin/protTrace.py -h

this should obtain

USAGE:  protTrace.py -i <omaIdsFile> | -f <fastaSeqsFile> -c <configFile> [-h]
        -i              Text file containing protein OMA ids (1 id per line)
        -f              List of input protein sequences in fasta format
        -c              Configuration file for setting program's dependencies

Input Data

protTest can use either OMA protein ids, or a protein sequence in fasta format as input

In toy_example/ you can find two files, test.ids and test.fasta for performing a test run with protTrace.

We describe the input in the section Test Run of our WIKI.

Test Run

We provide in the directory toy_example two files for testing protTrace

  1. test.ids: This file contains the OMA protein id of a yeast protein DIM1. To run this test:
    1. create a config file prot.conf using the create_conf.pl script. We recommend to leave all values as default for the start
    2. place the config file into the directory toy_example
    3. enter the directory toy_example and run protTrace by typing
    python ../bin/protTrace.py -i test.id -c prot.conf
    
    The output that will be generated by this run is described in the WIKI
  2. test.fasta: This file contains the protein sequence of human ZNT3.
    1. create or modify the config file prog.conf using the create_conf.pl script. Make sure to set in the section General Options the entry species to HUMAN
    2. place the config file into the directory toy_example
    3. enter the directory toy_example and run protTrace by typing
    python ../bin/protTrace.py -f test.fasta -c prot.conf
    
    The output that will be generated by this run is described in the WIKI

WIKI

Read the WIKI to explore the functionality of protTrace.

Bugs

Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.

Acknowledgements

We would like to thank the members of Ebersberger group for many valuable suggestions and ...bug reports :)

Contributors

License

This tool is released under GNU-GPL3.0 license.

How-To Cite

Arpit Jain, Arndt von Haeseler, Ingo Ebersberger The evolutionary Traceability of protein (2018) BioRxiv

Contact

Ingo Ebersberger ebersberger@bio.uni-frankfurt.de

About

A simulation based framework to estimate the evolutionary traceability of protein.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •