Skip to content

Tutorial

martineh edited this page Sep 30, 2014 · 21 revisions

Tutorial

HPG Aligner provides two commands to build index files for the reference genome. HPG Aligner is the fastest tool creating index, we use multicore to speed up this process. Depending on the size of reference genome, the index creation may need a lot of memory. The commands to build the index are:

  • build-sa-index command: to create the index based on suffix arrays (SA)
  • build-bwt-index command: to create the index based on the Burrows-Wheeler Tranform (BWT)

and two commmands to align reads into the reference genome depending on the input sequences:

  • dna command: to map DNA sequences
  • rna command: to map RNA-seq

The following sections describe the parameters used by each command.

The build-sa-index command

./hpg-aligner build-sa-index

The command line options for the build-sa-index command are:

-g, --ref-genome=<file>       Reference genome
-i, --bwt-index=<directory>   SA directory name

The build-bwt-index command

./hpg-aligner build-bwt-index

The command line options for the build-bwt-index command are:

-g, --ref-genome=<file>       Reference genome
-i, --bwt-index=<directory>   BWT directory name
-r, --index-ratio=<int>       BWT index ratio of compression

The dna command

For alignment of DNA sequences:

./hpg-aligner dna

General options:

-f, --fq, --fastq=<file>                                Reads file input
-i, --bwt-index=<file>                                  Index directory name
-o, --outdir=<file>                                     Output directory
--prefix=<string>                                       File prefix name
--bam-format                                            BAM output format (otherwise, SAM format)

Pair mode options:

-j, --fq2, --fastq2=<file>                                  Reads file input #2 (for paired mode)
--paired-min-distance=<int>                             Minimum distance between pairs
--paired-max-distance=<int>                             Maximum distance between pairs

Report options:

--report-all                                            Report all alignments
--report-n-best=<int>                                   Report the <n> best alignments
--report-n-hits=<int>                                   Report <n> hits`
--report-only-paired                                    Report only proper paired alignments
--report-best                                           Report all alignments with the best score
-l, --log-level=<int>                                   Log debug level
-h, --help                                              Help option

Seeding options:

--num-seeds=<int>                                       Number of seeds per read

Smith-Waterman options for the gap alignments:

--sw-match=<double>                                     Match value for Smith-Waterman algorithm
--sw-mismatch=<double>                                  Mismatch value for Smith-Waterman algorithm
--sw-gap-open=<double>                                  Gap open penalty for Smith-Waterman algorithm
--sw-gap-extend=<double>                                Gap extend penalty for Smith-Waterman algorithm
--sw-min-score=<double>                                 Minimum score for valid mappings

Architecture options:

--cpu-threads=<int>                                     Number of CPU threads
--read-batch-size=<int>                                 Batch size for reading

Post-processing options:

--indel-realignment                                     Indel-based re-alignment
--recalibration                                         Base quailty score recalibration

The rna command

For alignment of RNA sequences:

./hpg-aligner rna 

General options:

-f, --fq, --fastq=<file>                                Reads file input
-i, --bwt-index=<file>                                  BWT directory name
-o, --outdir=<file>                                     Output directory
-e, --ext=<file>                                        File extend name
--bam-format                                            BAM output format (otherwise, SAM format)

RNA-seq specific options:

--max-intron-size=<int>                                 Maximum intron size
--min-intron-size=<int>                                 Minimum intron size
--min-score=<int>                                       Minimum score for valid mappings
--transcriptome-file=<file>                             Transcriptome file to help search splice junctions

Pair mode options:

--fq2, --fastq2=<file>                                  Reads file input #2 (for paired mode)
--paired-min-distance=<int>                             Minimum distance between pairs
--paired-max-distance=<int>                             Maximum distance between pairs

Report options:

--report-all                                            Report all alignments
--report-n-best=<int>                                   Report the <n> best alignments
--report-n-hits=<int>                                   Report <n> hits
--report-only-paired                                    Report only proper paired alignments
--report-best                                           Report all alignments with the best score
-l, --log-level=<int>                                   Log debug level
-h, --help                                              Help option

Seeding options:

--seed-size=<int>                                       Number of nucleotides in a seed (only for BWT mode)
--min-cal-size=<int>                                    Minimum CAL size

Smith-Waterman options for the gap alignments:

--sw-match=<double>                                     Match value for Smith-Waterman algorithm
--sw-mismatch=<double>                                  Mismatch value for Smith-Waterman algorithm
--sw-gap-open=<double>                                  Gap open penalty for Smith-Waterman algorithm
--sw-gap-extend=<double>                                Gap extend penalty for Smith-Waterman algorithm

Architecture options:

--cpu-threads=<int>                                     Number of CPU threads
--read-batch-size=<int>                                 Batch size for reading