HATCHet is an algorithm to infer allele and clone-specific copy-number aberrations (CNAs), clone proportions, and whole-genome duplications (WGD) for several tumor clones jointly from multiple bulk-tumor samples of the same patient or from a single bulk-tumor sample. HATCHet has been designed and developped by Simone Zaccaria in the group of prof. Ben Raphael at Princeton University. The full description of the algorithm, the comparison with previous state-of-the-art methods, and its application on published cancer datasets are described in
Simone Zaccaria and Ben Raphael, 2018
The novel simulationg framework, MASCoTE, is available at
The simulated data, the results of all the methods considered in the comparison, the results of HATCHet on the published whole-genome multi-sample tumor sequencing datasets ((Gundem et al., Nature, 2015) and (Makohon-Moore et al., Nature genetics, 2017)) and all the related analyses are available at
This repository includes a quick overview of the HATCHet's algorithm and software, detailed instructions for installation and requirements, a script to execute the full pipeline of HATCHet, demos to illustrate how to apply HATCHet on different datasets with different features, a list of current issues, and contacts.
Overview of HATCHet algorithm.
- HATCHet analyzes the read-depth ratio (RDR) and the B-allele frequency (BAF) in bins of the reference genome (black squares) jointly from multiple tumor samples. Here, we show two tumor samples p and q.
- HATCHet globally clusters the bins based on RDR and BAF along the entire genome and jointly across samples p and q. Each cluster (color) includes bins with the same copy-number state within each clone present in p or q.
- HATCHet estimates the fractional copy number of each cluster. If there is no WGD, the identification of the cluster (magenta) with copy-number state (1, 1) is sufficient and RDRs are scaled correspondingly. If a WGD occurs, HATCHet finds the cluster with copy-number state (2, 2) (same magenta cluster) and a second cluster having an identical copy-number state in all tumor clones.
- HATCHet factorizes the allele-specific fractional copy numbers F^A, F^B into the allele-specific copy numbers A, B, respectively, and the clone proportions U. Here there is a normal clone and 3 tumor clones.
- HATCHet's model selection criterion identifies the matrices A, B and U in the factorization while evaluating the fit according to both the inferred number of clones and presence/absence of a WGD.
- Clusters are classified by their inferred copy-number states in each sample. Sample-clonal clusters have a unique copy-number state in the sample and correspond to evenly-spaced positions in the scaled RDR-BAF plot (vertical grid lines in each plot). Sample-subclonal clusters (e.g. cyan in p) have different copy-number states in a sample and thus correspond to intermediate positions in the scaled RDR-BAF plot. Tumor-clonal clusters have identical copy-number states in all tumor clones -- thus they are sample-clonal clusters in every sample and preserve their relative positions in scaled-RDR-BAF plots. In contrast, tumor-subclonal clusters have different copy-number states in different tumor clones and their relative positions in the scaled RDR-BAF plot varies across samples (e.g. purple cluster).
The current implementation of HATCHet is composed of two sets of modules:
(1) The core modules of HATCHet are designed to efficiently solve a challenging constrained and distance-based simultaneous matrix factorization which aim to infer allele and clone-specific copy numbers and clone proportins from fractional copy numbers. The module is implemented in C++11 and are included in src
folder.
(2) The utility modules of HATCHet perform several different tasks that are needed to process the raw data, perform steps of the HATCHet's algorithm needed for the factorization, and process the results. These task include reading/calling germinal single-point mutations, counting reads from a BAM file, combining the read counts and other information, segmenting through HATCHet's global approach, plotting very useful information, etc. These modules are implemented in python 3 and are available as the util
and bin
submodules.
The setup process is composed of 3 steps:
- Installation: the compilation and installation of Hatchet and its dependencies; this is a one-time process.
- Using Gurobi: the requirements to run Gurobi; these need to be satisfied before every run.
- Required data: the requirements for considered data; these need to be satisfied whenever using new data.
HATCHet
can be installed using an existing installation of conda
(e.g. either the compact
Miniconda or the complete Anaconda).
First, bioconda
requires to set the following channels in this exact order:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Next, HATCHet
can be installed with the following one-time command:
conda install hatchet
This command installs hatchet
in the base conda environment, however best practices generally recommend, especially in computing servers, to keep distinct packages independent.
Therefore, it is preferable to use the following command to install hatchet
:
conda create -n hatchet hatchet
In this case, hatchet
must be activated before every session with the following command
source activate hatchet
If you wish to install HATCHet
directly from this repository, the steps are a bit more involved. Please refer to the
Manual Installation document for more details.
Every run of HATCHet (especially, the hatchet
step) needs to use Gurobi which requires a valid license pointed by the environmental variable GRB_LICENSE_FILE
. This can be easily obtained depending on the type of free academic license available:
-
Individual license. This license can be obtained easily by any academic user with an institutional email. This license is user and machine-specific, meaning that the user needs to require a different license for every used machine. Assuming the license is stored at
/path/to/gurobi.lic
, set the environment variableGRB_LICENSE_FILE
to point to it:export GRB_LICENSE_FILE="/path/to/gurobi.lic"
-
Multi-use license. This license can be used by multiple users on any machine in a cluster. This license can be obtained but needs to be requested by the IT staff of the user's institution. This license is typically used in a machine cluster and only requires the following command (the exact form of this command may vary):
module load gurobi
HATCHet requires 3 input data files:
-
One or more BAM files containing DNA sequencing reads obtained from tumor samples of a single patient. Every BAM file contains the sequencing reads from a sample ad needs to be indexed and sorted. For example, each BAM file can be easily indexed and sorted using SAMtools. In addition, one can improve the quality of the data by processing the BAM files according to the GATK Best Practices.
-
A BAM file containg DNA sequencing reads obtained from a matched-normal sample of the same patient of the considered tumor samples. The BAM file needs to be indexed and sorted as the tumor BAM files. Also, the BAM files can be processed as the tumor BAM files.
-
A human reference genome. Ideally, one should consider the same human reference genome used to align the sequencing reads in the given BAM files. The most-used human reference genome are available at GRC or UCSC. Observe that human reference genomes use two different notations for chromosomes: either
1, 2, 3, 4, 5 ...
orchr1, chr2, chr3, chr4, chr5 ...
. One needs to make sure all BAM files and reference genome share that same chromosome notation. When this is not the case, one needs to change the reference to guarantee consistency and needs to re-index the new reference (e.g. using SAMtools). Also, HATCHet requires that the name of each chromosome is the first word in each ID such that>1 [ANYTHING] ... \n>2 [ANYTHING] ... \n>3 [ANYTHING] ...
or>chr1 [ANYTHING] ... \n>chr2 [ANYTHING] ... \n>chr3 [ANYTHING]
.For the reference genome, HATCHet requires the existence of a a sequence dictionary (
.dict
), which is part of all standard pipelines for sequencing data, see for example GATK or Galaxy. Please note that the sequence dictionary is NOT the reference index.fai
, which is a different structure, has a different function, and it is also recommended.The dictionary of a reference genome is often included in the available bundles for the reference genomes, see the example for hg19 from Broad Institute. However, the dictionary can also be generated in seconds using either SAMtools or Picard tools.
In the folder where you want to download and index the human genome, the steps would typically be:
curl -L https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz | gzip -d > hg19.fa samtools faidx hg19.fa samtools dict hg19.fa > hg19.dict
The repository includes all the components that are required to cover every step of the entire HATCHet's pipeline, starting from the processing of raw data reported in a BAM file through the analysis of the final results. We provide:
-
A script representing the full pipeline of HATCHet, and we describe in details the whole script through a tutorial with instructions for usage.
-
Demos that correspond to guided executions of HATCHet on some examples, and explain in detail the usage of HATCHet when considering standard datasets, real datasets with high noise, and different kind of data.
-
Custom pipelines which adapt the full HATCHet's pipeline to special conditions or integrates pre-processed data belonging to different pipelines.
The implementation of HATCHet is highly modular and one can replace any HATCHet's module with any other method to obtain the required results (especially for the pre-processing modules). As such, we also provide here an overview of the entire pipeline and we describe the details of each step in a dedicated section of the manual.
-
Recommendations, especially for noisy datasets or with different features, to guide the user in the interpretation of HATCHet's inference. We explain how to perform quality control to guarantee the best-quality results, and describe how the user can control and tune some of the parameters to obtain the best-fitting results.
HATCHet is in active development, please report any issue or question as this could help the devolment and imporvement of HATCHet. Current known issues with current version are reported here below:
- The allele-swapping feature of comBBo has been temporarily disabled due to conflicts with recent SAMtools versions.
- HATCHet has not been tested on Windows yet.
A list of the major recent updates:
- HATCHet accepts now any reference genome (including non humans).
- HATCHet now handles genomic regions properly, recommended for WES data.
HATCHet is actively maintained by Simone Zaccaria, currently a post-doctoral research associate at Princeton University in the research group of prof. Ben Raphael.