STIX (Structural Variant Index) supports searching every discordant paired-end and split-read alignment from thousands of sample BAMs or CRAMs for the existence of an arbitrary SV. STIX reports a per-sample count of all concurring evidence. From these counts we can, for example, conclude that an SV with high-level evidence in many samples is common and an SV with no evidence is rare.
usage: stix <options>
options:
-i index directory
-s slop
-P padding base piars for query insertion(default 50)
-p PED file
-B Sharding file
-Q Batch queries in table format(left \t right \t len \t SVtype \t ID)
-T Threshold to define STIX_ONE(How many supporting reads define a hit) (default:1)
-c Alt file column (default 1)
-d PED database file
-r right SV region
-l left SV region
-f VCF file
-a List of columns to aggregate over
-F Filter samples by PED field query
-j JSON output
-t SV type (DEL,INS,INV,DUP,BND)
-L Length of Insertion(use it when -t INS)
-R Relative Erorr Threshold to compare the length of query INS and targeted INS (0.0-1.0) (default:0.15)
-v Add sample depth to VCF file
-S Give only summary
-D Give sample depth array
-V Verbose mode(print debug information, will increase output file size greatly)
This repository contains all related tools for using STIX. It includes:
- STIX (main program)
- Giggle (tool for indexing)
- excord (SV singals extraction for short-read)
- excordlr (SV singals extraction for long-read)
The bundled toolset will be published as docker images with version numbers. Please note that the version is NOT the stix version, it is the stix-suite version.
Download the docker image with specific version
docker pull zhengxc1993/stix-suite:<version>
Run tools
docker run --rm zhengxc1993/stix-suite:<version> stix
-
stix[b3fedd9]
-
giggle[4071cb7]
-
excord[v0.2.4]
-
excord-lr[v0.1.17]
-
stix-merge[1.0.0]
cd stix-suite
docker build -t stix-suite:1.0.0 -f Dockerfile versions/1.0.1/
docker tag 784ea063777c zhengxc1993/stix-suite:1.0.1
docker push zhengxc1993/stix-suite:1.0.1
-
stix[b3fedd9]
-
giggle[4071cb7]
-
excord[v0.2.4]
-
excord-lr[v0.1.17]
stix-suite
docker build -t stix-suite:1.0.0 -f Dockerfile versions/1.0.0/
docker tag 784ea063777c zhengxc1993/stix-suite:1.0.0
docker push zhengxc1993/stix-suite:1.0.0
The following example is based on four sample BAMs from the 1000 Genomes project:
samtools view -b s3://1000genomes/phase3/data/NA12812/alignment/NA12812.mapped.ILLUMINA.bwa.CEU.low_coverage.20130415.bam 13 14 > NA12812.13.14.bam
samtools view -b s3://1000genomes/phase3/data/HG00672/alignment/HG00672.mapped.ILLUMINA.bwa.CHS.low_coverage.20120522.bam 13 14 > HG00672.13.14.bam
samtools view -b s3://1000genomes/phase3/data/NA12878/alignment/NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam 13 14 > NA12878.13.14.bam
samtools view -b s3://1000genomes/phase3/data/HG00674/alignment/HG00674.mapped.ILLUMINA.bwa.CHS.low_coverage.20121211.bam 13 14 > HG00674.13.14.bam
To create an STIX index on short-reads dataset, use excord
to extract discordant paired-end reads and
split reads.
mkdir four_alt
wget -O excord https://github.com/brentp/excord/releases/download/v0.2.2/excord_linux64
chmod +x excord
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
for sample in NA12812.13.14 HG00672.13.14 NA12878.13.14 HG00674.13.14; do
samtools view -b $sample.bam \
| ./excord \
--discordantdistance 500 \
--fasta hs37d5.fa.gz \
/dev/stdin \
| LC_ALL=C sort --buffer-size 2G -k1,1 -k2,2n -k3,3n \
| bgzip -c > four_alt/$sample.bed.gz
done
To create an STIX index on long-reads dataset, use excord-lr
to extract
split reads.
mkdir four_alt
wget -O excord https://github.com/zhengxinchang/excord-lr/releases/download/v0.1.17/excord-lr
chmod +x excord-lr
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
for sample in NA12812.13.14 HG00672.13.14 NA12878.13.14 HG00674.13.14; do
./excord-lr -b $sample.bam -o $sample.bed
LC_ALL=C sort --buffer-size 2G -k1,1 -k2,2n -k3,3n $sample.bed \
> bgzip -c > four_alt/$sample.bed.gz
done
Now index with GIGGLE.
giggle index -i "four_alt/*gz" -o four_alt_b -s -f
STIX requires a PED file that needs at least sample name and discordant alignment file name, but can support an arbitrary number of columns. For example:
Sample Sex Populationzv Super_Population Alt_File
NA12812 1 CEU EUR NA12812.13.14.bed.gz
HG00672 2 CHS EAS HG00672.13.14.bed.gz
NA12878 2 CEU EUR NA12878.13.14.bed.gz
HG00674 1 CHS EAS HG00674.13.14.bed.gz
From a PED file create a database/.
stix -i four_alt_b -p test/data/four.ped -d four.ped.db -c 5
STIX can query either a single SV or annotate a VCF file. All queries require a slop parameter that accounts for the variability in the normal fragment size. In most cases use the mean fragments size plus 3 times the fragment size standard deviation.
For individual queries, STIX takes the SV type and the left and right positions of the SV, where left is (by convention) the lower chromosomal position. Since many SVs do not have base pair resolution, STIX takes a genomic range for each side.
stix -i four_alt_db -d four.ped.db -s 500 -t DEL -l 14:68603030-68603035 -r 14:68603738-68603743
Total 0:1 0:0 7:8:11 0:2:1:1
Giggle_File_Id Sample Sex Population Super_Population Alt_File Pairend Split
0 HG00672 2 CHS EAS HG00672.13.14.bed.gz 8 0
1 HG00674 1 CHS EAS HG00674.13.14.bed.gz 7 0
2 NA12812 1 CEU EUR NA12812.13.14.bed.gz 7 0
3 NA12878 2 CEU EUR NA12878.13.14.bed.gz 11 0
For a VCF, STIX uses the INFO/SVTYPE
field for the SV type and the
POS
, INFO/END
, INFO/CIPOS
, and INFO/CIEND
fields for the quarry
positions. STIX annotates each variant with four fields:
STIX_ZERO
gives the number of samples that have no evidence for the SVSTIX_ONE
gives the number of samples that have one alignments supporting the SVSTIX_QUANTS
has three values that are the quantiles for the per sample count of alignments supporting the SVSTIX_QUANT_DEPTHS
has four values that give the number of samples with SV an evidence count that is- between 2 and the quantile one value number of alignments
- between quantile one to quantile two
- quantile two to quantile three
- more than quantile three
stix -i four_alt_db -d four.ped.db -s 500 -f 1kg.four.13.14.vcf.gz
git clone https://github.com/ryanlayer/giggle.git
cd giggle
make
cd ..
wget http://www.sqlite.org/2017/sqlite-amalgamation-3170000.zip
unzip sqlite-amalgamation-3170000.zip
git clone https://github.com/ryanlayer/stix.git
cd stix
make
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
0 0 0,4,11 0,0,1,1
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
19 12694867 G <CN0> CIEND=0,1;CIPOS=0,1;END=12698924 1|1 1|1 1|1
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 1/1 13 796.49 -82,-3,-2 40 9 30 9 30 0 0 0 9 30 0.77
NA12890 1/1 10 609.94 -63,-3,-2 31 7 23 7 23 0 0 0 7 23 0.77
NA12889 0/1 139 794.47 -80,-1,-15 68 33 34 32 33 20 0 0 12 33 0.51
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
19 12694907 N <DEL> CIPOS=-10,69;CIEND=-59,4;END=12698928 1/1 1/1 1/1
High coverage | Low coverage |
---|---|
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
3 0 0,0,0 0,0,0,0
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
5 1022803 C <CN0> CIEND=-500,1000;CIPOS=-1000,500;END=1025877 1|1 1|1 1|1
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/0 200 0.00 -0,-18,-61 62 62 0 61 0 27 0 0 34 0 0
NA12890 0/0 200 0.00 -0,-23,-78 78 78 0 78 0 37 0 0 41 0 0
NA12889 0/0 200 0.00 -0,-33,-109 110 109 0 109 0 55 0 0 54 0 0
N/A
High coverage | Low coverage |
---|---|
STIX_ZERO STIX_ONE STIX_QUANTS STIX_QUANT_DEPTHS
0 0 4,8,10 0,1,1,1
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
4 113985874 . <DEL> END=113986369 0|0 0|0 0|0
GT GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 1/1 24 1484.47 -152,-6,-3 74 17 56 17 56 0 0 0 17 56 0.77
NA12890 0/1 105 446.67 -46,-1,-11 42 22 19 22 19 0 0 0 22 19 0.46
NA12889 0/1 21 1259.40 -129,-3,-6 71 21 49 21 49 0 0 0 21 49 0.7
#CHROM POS REF ALT INFO NA12878 NA12890 NA12889
4 113985951 . <DEL> CIPOS=-12,2;CIEND=-51,12;END=113986325 1/1 0/1 1/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS INFO NA12878 NA12890 NA12889
12 12544792 END=12546607;CIEND=-21,21;CIPOS=-21,21;AC=4129;AF=0.824481 1|1 1|1 1|1
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/1 58 58 0 114 1478.17 -150,-2,-14 99 39 60 38 59 20 0 2 18 57 0.61
NA12890 1/1 54 54 0 25 1356.74 -139,-5,-3 67 15 51 15 51 0 0 0 15 51 0.77
NA12889 0/1 46 46 0 200 1073.26 -108,-1,-22 94 47 46 47 45 26 0 1 21 44 0.49
#CHROM POS INFO NA12878 NA12890 NA12889
12 12544868 END=12546613;CIPOS=-2,17;CIEND=-15,19; 0/1 1/1 0/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
#CHROM POS INFO NA12878 NA12890 NA12889
12 47290470 END=47309756;CIEND=-55,55;CIPOS=-55,55;AC=248;AF=0.0495208 0|0 0|0 1|0
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/0 0 0 0 200 0.00 -0,-39,-130 131 131 0 130 0 64 0 0 66 0 0
NA12890 0/0 0 0 0 200 0.00 -0,-35,-115 116 116 0 115 0 56 0 0 59 0 0
NA12889 0/1 79 56 23 146 578.17 -65,-7,-59 120 87 32 86 31 38 0 9 48 22 0.26
#CHROM POS INFO NA12878 NA12890 NA12889
12 47290448 END=47309758;CIPOS=-3,2;CIEND=-10,5; 0/0 0/0 0/1
High coverage | Low coverage |
---|---|
1KG low-coverge SV callset (Sudmant et al., Nature 2015) (VCF, Paper)
N/A
GT SU PE SR GQ SQ GL DP RO AO QR QA RS AS ASC RP AP AB
NA12878 0/1 0 0 0 4 53.14 -6,-0,-1 4 1 2 1 2 1 0 2 0 0 0.67
NA12890 0/0 0 0 0 200 0 -0,-28,-93 94 94 0 93 0 42 0 0 51 0 0
NA12889 0/1 68 44 24 143 383.23 -48,-9,-65 115 89 25 88 24 42 0 2 46 22 0.21
#CHROM POS INFO NA12878 NA12890 NA12889
2 89161083 END=89185670;CIPOS=0,0;CIEND=-9,0 0/1 0/0 0/1
High coverage | Low coverage |
---|---|