Hotspot Missense mutation Areas in Protein Structures (HotMAPS) detects 3D somatic mutation hotspots in PDB structures. Identical chains are accounted for in the model to prevent errors induced by assuming chains are independent. For example, an annotated mutation in a gene forming a homodimer will always contain the same mutation in both chains. PDB biological assemblies are used when available to best reveal biologically meaningful 3D hotspots.
Running the is generally done through using make
commands that invoke scripts
found in the scripts directory or top-level scripts.
HotMAPS is intended to be ran on linux operating systems.
The python code depends on the following python libraries (listed in requirements.txt):
- numpy
- scipy
- biopython
The necessary python libraries can be installed through pip.
$ pip install -r requirements.txt
The exact version numbers may not be necessary but are what it was tested on.
For consistency of processing protein chain descriptions found in the PDB file to that performed in MuPIT, we use BioJava to annotate each chain and add that label as a column in the input file to our pipeline. Download the BioJava 3.1 core and structure jar files and place them in a directory called "lib".
First, download the Protein Data Bank (PDB) structures from ftp://ftp.wwpdb.org/pub/pdb/ and the theoretical protein structure models (ftp://salilab.org/databases/modbase/projects/genomes/H_sapiens/2013/). You will need both the RefSeq and Ensembl theoretical protein structure models (H_sapiens_2013.tar.xz and ModBase_H_sapiens_2013_refseq.tar.xz, respectively). Then update the config.txt to point toward the directories that you save the structure files at after extracting from compressed format. Additionally, download the mutations file, protein structure annotation file, and annotations for the CRAVAT reference transcript available here. Place all three files in a sub-directory called "data". Assuming you are already in the HotMAPS directory:
$ mkdir -p data
$ cd data
$ wget http://karchinlab.org/data/HotMAPS/mutations.txt.gz
$ gunzip mutations.txt.gz
$ wget http://karchinlab.org/data/HotMAPS/pdb_info.txt.gz
$ gunzip pdb_info.txt.gz
$ wget http://karchinlab.org/data/HotMAPS/mupit_annotations.tar.gz
$ tar xvzf mupit_annotations.tar.gz
$ cd ..
Assuming you have changed the config.txt
file to point towards where you downloaded the protein structure files, an additional step is needed to annotate those protein structures.
$ make annotateStructures
To run the code in parallel using Sun Grid Engine (SGE) execute the following make command:
$ make OUTPUT_DIR=myoutput_dir runParallelHotspot
To run the code normally (no parallelization) execute:
$ make OUTPUT_DIR=myoutput_dir runNormalHotspot
myoutput_dir
is the output directory (Default: output/all_pdb_run).
Note if you ran the normal version instead of parallel, you need not run this next step as the merged file will already be produced. To merge the output from the parallel runs use the following make command:
$ make OUTPUT_DIR=myoutput mergeHotspotFiles
Next, the p-values need to be adjusted for multiple hypotheses testing.
This needs the CRAVAT reference transcript files noted in the Initital Setup
section that was saved in the "data" sub-directory (parameter MUPIT_ANNOTATION_DIR
in the make command).
$ make multipleTestCorrect OUTPUT_DIR=myoutput MUPIT_ANNOTATION_DIR=annotation_dir Q_VALUE=myqvalue
myqvalue
is the q-value for the False Discovery Rate (FDR) correction (.01 by default). The next step is group significant residues
into regions. If you are interested in regions on the actual PDB protein structure,
script use the following command:
$ make findHotregionStruct OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
Where like before myoutput_dir
is the output directory and myqvalue
is the
q-value (Default: .01). Similarly the regions can be constructed for each gene
using the reference transcript selected by CRAVAT for each mutation.
$ make findHotregionGene OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
This is very similar to the 3D detection one but instead of the 3D versions of the make commands, the 1D versions are used instead. If you have a SGE cluster, the code can be executed in parallel:
$ make runParallelHotspot1D OUTPUT_DIR=myoutput_dir
If not, then run the non-parallel version:
$ make runNormalHotspot1D OUTPUT_DIR=myoutput_dir
Only for the parallel execution do you need to execute the next merging step. Otherwise continue without running.
$ make mergeHotspotFiles1D OUTPUT_DIR=myoutput_dir
Perform multiple testing correction on the p-values.
$ make multipleTestCorrect1D OUTPUT_DIR=myoutput MUPIT_ANNOTATION_DIR=annotation_dir Q_VALUE=myqvalue
Construct hotspot regions for each structure.
$ make findHotregionStruct1D OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
Construct merged hotspot regions evidenced from all PDBs to form the gene-level regions.
$ make findHotregionGene1D OUTPUT_DIR=myoutput_dir Q_VALUE=myqvalue MUPIT_ANNOTATION_DIR=annotation_dir
The input data for HotMAPS can also be prepared by directly using the MuPIT MySQL database. A MySQL dump of the MuPIT database containing mutation counts in our study and associated tables that map genome coordinates to PDB structures is available here. The MuPIT database has a fairly large file size, you may want to directly download and upload to MYSQL.
$ wget http://karchinlab.org/data/HotMAPS/mupit_modbase.sql.gz
$ gunzip mupit_modbase.sql.gz
$ mysql [options] < mupit_modbase.sql
This will create a database named mupit_modbase
, where [options]
is the necessary MySQL parameters to login.
Next, the input files need to be generated before starting the Running 3D HotMAPS
section. The initial input information
is retrieved from the MuPIT MySQL database, in contrast to downloading already made files (done in the Initial Setup
section). To prepare the input files simply invoke the following make command.
$ make MYSQL_USER=myuser MYSQL_HOST=myhost MYSQL_DB=mydb prepareHotspotInput
Where myuser
is your MySQL user name, myhost
is the host name for MySQL, and mydb
is the database name
for Mupit (Default: mupit_modbase).