Skip to content

Latest commit

 

History

History
73 lines (55 loc) · 3.51 KB

README.md

File metadata and controls

73 lines (55 loc) · 3.51 KB

Introduction

Genotate is a tool to annotate prokaryotic and phage genomes. It uses scrolling amino-acid windows in all six frames to distinguish between windows that belong to protein coding gene regions and those that belong to noncoding regions, in order to determine the coding frame at every position along the genome.

  • Unlike every other currently available gene caller, Genotate does not rely on start and stop codons in order to predict coding genes

To install Genotate,

 pip install genotate

And to run Genotate you only need to specify the FASTA formatted genome file The command to run using the phage models on the provided phiX174 genome is:

 genotate.py test/phiX174.fasta -o predictions.gb

The command to run using the partially trained bacterial/archaeal models needs the --bacterial flag. Instead of a FASTA formatted file, you can provide a Genbank formatted file and Genotate will use only the genomic sequence.

 genotate.py test/mycoplasma.gbff.gz -o predictions.gb --bacteria

It is recommended to use a GPU to run Genotate since it will take a long time to run prokaryotic genomes. Genotate will automatically try to run on GPU, if one isn't found it will run on a CPU.


† The output of Genotate are 'coding region' predictions in GenBank format. They should match with the true coding gene regions, but are not genes per say, since they are not based on start and stop codons. Though they have all been trimmed to a stop codon after Genotate determines which transation table the genome uses (i.e. whether it performs stop codon readthrough).

There are three main phases to the Genotate workflow

  1. window classification
  2. change-point detection
  3. refinement
    • analyze stop codons
    • merge adjacent regions
    • split regions on stop
    • adjust ends to a stop

Genotate determines the translation table by analyzing the initial coding gene region predictions. There are two outcomes for a stop codon that is readthrough: either the stop codon appears in the middle of a coding gene region or the region is broken into two pieces at the stop codon. If one of the three known stop codons is significantly over represented in the middle AND between predicted gene regions, that stop codon can be assumed to be read through. With the stop codon usage now known, adjacent coding regions that are in the same frame are merged if there is not a stop codon between them. Then the regions are split on any internal stop codons and the ends adjusted to the nearest stop codon.

** The end opposite the stop codon is not adjusted to a valid start codon since Genotate does not (yet) have a translation initiation site detection method yet, so the beginning of a gene call may be off by a few codons

Currently the best way to visualize the predictions is in a Genome Viewer application, such as Artemis by Sanger. The example phiX174.gb GenBank file loaded into Artemis shows the gene layout:

The Genotate gene calls in the output predictions.gb file can then be loaded using the 'File>Read An Entry' menu, and the predictions will be overlaid as grey 'coding regions' in the gene layout window:

The fact that Genotate calls 9 out of the 10 known coding genes of phiX174, including the fully nested genes B and K, shows just how unrivaled Genotate is among currently available gene callers.