- Free software: GNU General Public License v3 (GPLv3)
Manuscript: 10.1093/gigascience/giab080
Hoogstrate Y; Komor MA; Böttcher R; van Riet J; van de Werken HJG;
van Lieshout S; Hoffmann R; van den Broek E; Bolijn AS; Dits N; Sie D;
van der Meer D; Pepers F; Bangma CH; van Leenders GJLH; Smid M;
French PJ; Martens JWM; van Workum W; van der Spek PJ; Janssen B;
Caldenhoven E; Rausch C; de Jong M; Stubbs AP; Meijer GA; Fijneman RJA;
Jenster GW:
Fusion transcripts and their genomic breakpoints in polyadenylated
and ribosomal RNA–minus RNA sequencing data.
GigaScience, Volume 10, Issue 12, December·2021, giab080,
https://doi.org/10.1093/gigascience/giab080$
Supporting Data: 10.5524/100939
Hoogstrate Y; Komor MA; Böttcher R; van Riet J; van de Werken HJG;
van Lieshout S; Hoffmann R; van den Broek E; Bolijn AS; Dits N; Sie D;
van der Meer D; Pepers F; Bangma CH; van Leenders GJLH; Smid M;
French PJ; Martens JWM; van Workum W; van der Spek PJ; Janssen B;
Caldenhoven E; Rausch C; de Jong M; Stubbs AP; Meijer GA; Fijneman RJA;
Jenster GW (2021):
Supporting data for "Fusion transcripts and their genomic breakpoints in
poly(A)+ and rRNA-minus RNA sequencing data" GigaScience Database.
http://dx.doi.org/10.5524/100939
Fusion genes are often driver events in several types of cancer, generally detected with DNA-sequencing (DNA-seq) and more recently with RNA-sequencing (RNA-seq). In classical RNA-seq experiments, mRNA is extracted by targeting their poly-A tails, or reverse transcribed using oligo-dT primers. Multiple studies have shown that fusion transcripts can be successfully detected in such data, but the location of the DNA break remains often undetected because the majority of the breaks are in introns and poly-A RNA-seq data lacks intronic sequencing reads. By reverse transcribing rRNA-depleted RNA using random hexamer priming, the RNA library will also include pre-mRNA, covering introns in subsequent alignments.
Here we propose a computational method for detecting intron-to-intron breaks, corresponding to DNA breaks, on top of exon-to-exon splice junctions of fusion genes, using RNA-seq data only. To detect genomic breakpoints, intronic and exonic data are kept separated in a graph and further analysed. The analyses of the graph has built-in solutions for sequencing fragments rather than entire transcripts, for sequencing only the fragments' ends and for multiple exon-to-exon boundaries on top of genomic breaks. Unlike most RNA-seq fusion gene detection software, it is not restricted to exons or gene annotations, allowing detection of novel splice junctions, fusions to non-gene regions and fusions of non-polyadenylated transcripts.
However, using virtual environments gives more control over dependencies and local installations may be more convenient. This can be achieved as follows:
git clone https://github.com/yhoogstrate/dr-disco.git
cd dr-disco
virtualenv -p python3 .venv
source .venv/bin/activate
python setup.py install
Dr. Disco makes use of python3. A typical system wide installation could be achieved as follows:
git clone https://github.com/yhoogstrate/dr-disco.git
cd dr-disco
pip install -r requirements.txt ;
python3 setup.py install --user
git clone https://github.com/yhoogstrate/dr-disco.git
cd dr-disco
sudo pip install -r requirements.txt ;
sudo python3 setup.py install --user
Dr. Disco is also available at BioConda but this does not automatically ship with the blacklist files. Also, because of the changes in the built-system of bioconda, this version may be out of sync with the git repo.
The method is designed to analyse a single sample at the time, but can easily be run using linux parallel to scale up.
A dr-disco
pipeline typically consists of the following steps:
- Run STAR in fusion mode to obtain the Chimeric sam file - By the time of writing we used STAR 2.4.2
- Fix the Chimeric sam file and prepare it for analysis
- Detect fusion events and merge junctions corresponding to similar fusion transcripts
- Classify fusion genes based on statistical parameters (and reference genome specific blacklist)
- Integrate integrates results from the same genomic event and annotates gene names
STAR does not seem to be able to appropriately determine Chimeric reads if the orientation of the mates is not default. Hence, before running STAR, make sure whether the strand orientation is FR
(_R1: forward, _R2: _reverse). If this is not the case, you can use fastx_reverse_complement (http://hannonlab.cshl.edu/fastx_toolkit/) to convert to the appropriate strands.
- For forward, reverse (default) you can proceed with the original _R1 and _R2
- For forward, forward you need to use the original _R1 and create and use the reverse complement of _R2.
- For reverse, forward you need to create and use the reverse complement of _R1 and o_R2.
- For reverse, reverse you need to create and use the reverse complement of _R1 and the original of _R2 (but I am not aware of such data...).
The illumina NextSeq is claimed to be responsible for measuring poly-G sequences (https://www.biostars.org/p/294612/) which may result in vast amounts of discordant reads aligned to poly-G regions in the genome. It is recommended to clean such datasets and erase poly-G suffixes, for instance with the tool fastp.
Running RNA-STAR with fusion settings produces a file: <...>.Chimeric.out.sam
. This file contains discordant reads (split and spanning). It is recommended to run STAR with corresponding settings:
STAR --genomeDir ${star_index_dir} \
--readFilesIn ${left_fq_filenames} ${right_fq_filenames} \
--outFileNamePrefix {$prefix} \
--outSAMtype BAM SortedByCoordinate \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignIntronMax 200000 \
--alignMatesGapMax 200000 \
--alignSJDBoverhangMin 10 \
--alignEndsType Local \
--chimSegmentMin 12 \
--chimJunctionOverhangMin 12 \
--sjdbGTFfile {$gene_model_gtf} \
--sjdbOverhang 100 \
--quantMode GeneCounts \
--twopass1readsN -1 \
--twopassMode Basic
Due to the chimSegmentMin
and chimJunctionOverhangMin
settings, STAR shall produce the additional output file(s). This file may need to be converted to BAM format, e.g. by running samtools view -bhS Chimeric.out.sam > Chimeric.out.bam
. This generated ChimericSorted.bam
can then be used as input file for Dr. Disco. Note that samtools has changed its commandline interface over time and that the stated command might be slighly different for different versions of samtools.
The first step of Dr. Disco is fixing the BAM file. Altohugh fixing implies it is broken, the alignment provided by STAR is incompatible with the IGV split view and misses the 'SA:' sam flag. In dr-disco fix
this is solved and allows a user to view discordant reads in IGV in more detail and by using the spit-view. For the split view it may be convenient to color the reads by subtype (split or spanning, and how the direction of the break is), as the mate and strand are related and discordant mates are usually shifted a bit with respect to the breakpoint. In the fixing steps such annotations are added as read groups. Also, this step ensures proper indexing and sorting. If you have as input file <...>.Chimeric.out.bam
, you can generate the fixed bam file with Dr. Disco as follows:
Usage: dr-disco fix [OPTIONS] INPUT_ALIGNMENT_FILE OUTPUT_ALIGNMENT_FILE
Options:
-t, --temp-dir PATH Path in which temp files are stored (default: /tmp)
--help Show this message and exit.
Here the INPUT_ALIGNMENT_FILE will be <...>.Chimeric.out.bam
and OUTPUT_ALIGNMENT_FILE will be <...>.Chimeric.fixed.bam
.
Hence, you can generate the fixed bam-file with:
dr-disco fix '<...>.Chimeric.out.bam' 'sample.fixed.bam'
To estimate breakpoints using Dr. Disco, you can proceed with:
Usage: dr-disco detect [OPTIONS] BAM_INPUT_FILE OUTPUT_FILE
Options:
-m, --min-e-score INTEGER Minimal score to initiate pulling sub-graphs
(larger numbers boost performance but result in
suboptimal results) [default=8]
Here the -m
argument controls the level of merging sub-graphs. If datasets become very large there shall be many subgraphs. However, because this is such a time consuming process, it can be desired to skip some of them. Rule of thumb: for every 10.000.000 reads, increase this value with 1. So for 10.000.000-20.000.000 mate pairs choose 1. So for a dataset of 75.000.000 mate pairs, proceed with:
dr-disco detect -m 7 'sample.fixed.bam' dr-disco.sample-name.all.txt
The results from dr-disco detect
contain many false positives. These may be derived from many different types of issues, like homolpolymeric sequences in the reference genome, rRNA, multimaps and optimal chimeric alignments just by chance. Most of these issues are recognisable by patterns imprinted in some variables in the output table. Then dr-disco classify
filters the results and you can proceed with:
Usage: dr-disco classify [OPTIONS] TABLE_INPUT_FILE TABLE_OUTPUT_FILE
Options:
--only-valid Only return results marked as 'valid'
--blacklist-regions TEXT Blacklist these regions (BED file)
--blacklist-junctions TEXT Blacklist these region-to-region junctions
(custom format, see files in ./share/)
--min-chim-overhang INTEGER Minimum alignment length on each side of the
junction. May need to be set to smaller values
for read lengths smaller than 75bp. Larger
values are more stringent. [default=50]
--ffpe Lowers the threshold for the relative amount of
mismatches, as often found in FFPE material.
Note that enabling this option will
consequently result in more false positives.
--help Show this message and exit.
After analysing our results after classification there were some positives that were systemetically recurrent, but also visual inspection suggested them to be true. These are often new exons or new genes (and marked as discordant probably due to non canonical splice junctions) or mistakes in the reference genome. Hence, as they are in terms of data exactly identical to fusion genes, these can not be filtered out. Therefore we added the option to blacklist certain regions or junctions. There is an important difference between them as a blacklist region prohibits any fusion to be within a certain region while a blacklist junction prohibits a fusion where the one break is in one region and the other break in the other. We have added quite some of these regions to the blacklist. In particular the blacklist for hg38 is rather complete. The blacklist files for hg19 and hh38 can be found here: ./share/.
It may be inconvenient to include the results that are classified as false. By using --only-valid
the results are restricted to only those classified positive. Hence, a typical way of using dr-disco classify
for a hg38 reference genome would be:
dr-disco classify \
--only-valid \
--blacklist-regions "$dr_disco_dir/share/blacklist-regions.hg38.bed" \
--blacklist-junctions "$dr_disco_dir/blacklist-junctions.hg38.txt" \
dr-disco.sample-name.all.txt \
dr-disco.sample-name.filtered.txt
The results of Dr. Disco classify can be integrated into fusion genes by running dr-disco integrate
. The corresponding GTF file should contain gene annotations that correspond to the used reference genome. The reference FASTA is used to determine splice junction motifs and their edit distance. The file should have a *.fai index file. I am not sure if the pyfaidx library will auto-generate one if it does not exist.
Usage: dr-disco integrate [OPTIONS] TABLE_INPUT_FILE TABLE_OUTPUT_FILE
Options:
--gtf TEXT Use gene annotation for estimating fusion genes and improve
classification of exonic (GTF file)
--fasta TEXT Use reference sequences to estimate edit distances to splice
junction motifs (FASTA file)
--help Show this message and exit.
Hence, we could proceed with:
dr-disco integrate \
--gtf ENSEMBL 'GRCh38_latest_genomic.gff' \
dr-disco.sample-name.filtered.txt \
dr-disco.sample-name.integrated.txt
Usage: dr-disco bam-extract [OPTIONS] REGION1 REGION2 BAM_INPUT_FILE
BAM_OUTPUT_FILE
Options:
--restrict-to-targeted-chromosomes
Excludes reads of which a piece was aligned
to other chromosomes than requested by the
regions.
--help Show this message and exit.
This tool makes a regional subset of the bam-file. It considered read by their identifier, therefore avoiding the issue of missing mate pairs that were aligned outside the region. It can be provided two regions, which is ideal for targetting both sides of a fusion junction.