Skip to content
Jaime Huerta-Cepas edited this page Aug 27, 2020 · 2 revisions

Course Presentation

Welcome to the hands-on ETE Toolkit practical course!

This document and the attached repository are intended to provide a practical guided tour covering most important features in the ETE Toolkit. For this, we use a fake biological scenario as a background, so all examples and exercises are motivated by biological questions that can be addressed from a phylogenomic point of view.

The full exercise is divided into several sections and subsections, each representing a common phylogenomic question and covering different aspects of the ETE toolkit. A notebook with the code and commented solutions of each section is provided.

IMPORTANT: Note that data used in this course is manually manipulated to ensure the expected results Do not use it for real work!

Full Exercise

After many years of work, your lab has just isolated and sequenced a very interesting strain of the Aquifex aeolicus bacterium. This strain possesses a remarkable resistance to sulfur-rich environment. To investigate it further, you decide to address an in depth phylogenomic study where your strain is analyzed in the context of other known sulfur-related organisms and reference species.

How many bacteria and archaea are included in our set? Can you obtain a taxonomy tree relating them?

Reconstruct the evolutionary history of all Aquifex genes.

This is a tasks that should not require any interaction with the ETE toolkit, but it is provided here for consistency. Basically, we aim at producing gene family clusters (in FASTA format) out from the whole set of genomes considered. A quick to achieve this goal is to use a BLAST-based clustering of all sequences in the target genomes. For simplicity, we will use a basic MMSeqs2 pipeline.

The resulting families needed for the exercise are already provided at XXXXX, but the following notebook reproduces the building steps.

Identify good marker families (single copy genes present in all the species)