CICERO (Clipped-reads Extended for RNA Optimization) is an assembly-based algorithm to detect diverse classes
of driver gene fusions from RNA-seq.
Explore the docs »
Work with demo data »
Read the paper »
Request Feature
Report Bug
⭐ Consider starring the repo! ⭐
To discover driver fusions beyond canonical exon-to-exon chimeric transcripts, we develop CICERO, a local assembly-based algorithm that integrates RNA-seq read support with extensive annotation for candidate ranking. CICERO outperforms commonly used methods, achieving a 95% detection rate for 184 independently validated driver fusions including internal tandem duplications and other non-canonical events in 170 pediatric cancer transcriptomes.
- Running CICERO
- Dependencies
- Running with Docker
- Running with St. Jude Cloud
- Generate Junctions
- Reference Files
- Supported Genomes
- Demo
- Output Fields
- Citation
- License
Add the src/scripts
directory to your system PATH
variable. Add the src/perllib
and dependencies/lib/perl
directories to your system PERL5LIB
variable.
Then invoke the CICERO wrapper as
Cicero.sh [-h] [-n ncores] -b bamfile -g genome -r refdir [-j junctions] [-o outdir] [-t threshold] [-s sc_cutoff] [-c sc_shift] [-p] [-d]
-p - optimize CICERO, sets sc_cutoff=3 and sc_shift=10 [default true]
-s <num> - minimum number of soft clip support required [default=2]
-t <num> - threshold for enabling increased soft clip cutoff [default=200000]
-c <num> - clustering distance for grouping similar sites [default=3]
-j <file> - junctions file from RNApeg
-n <num> - number of cores to utilize with GNU parallel
-d - disable excluded regions file use
ncores
is the number of cores to be run on (with GNU parallel).bamfile
is the input bamfile mapped to human genome builds GRCh37-lite or GRCh38_no_alt. Contact us if your bam is based on other reference version.genome
is eitherGRCh37-lite
orGRCh38_no_alt
. CICERO only support the two human reference genome versions.refdir
is the reference file directory specific to CICERO. Download Reference Files below. e.g.-r /home/user/software/CICERO/reference_hg38/
or-r /home/user/software/CICERO/reference_hg19/
outdir
is user defined output file folder.junctions
is the junctions file output from RNApeg. See Generate Junctions below. CICERO can detect fusion by analysis of splice junction reads. If this option is omitted, fusions generated by small deletions may be missed as these events may lack the soft-clipped reads.threshold
CICERO first detects all soft-clipped positions supported by >=2 reads from bam file. For sample with <=threshold (default 200,000) soft-clipped positions, CICERO will detect fusions based on these soft-clipped positons; otherwise, to speed-up CICERO running, CICERO will detect fusions based on soft-clipped positions supported by >=sc_cutoff (3, default for optimize mode, see below) reads. For sample with lots of soft-clipped positions, a smaller threshold will speed-up CICERO running, however, some fusion events (i.e. only supported by 2 reads) may be missed.sc_cutoff
controls the number of soft clip reads required to support a putative site. The default is 2, but for samples with large numbers of soft clip reads, it may be desirable to require additional support to reduce the computational time required.sc_shift
sets the threshold for considering events to be the same site.optimize
defaults to ON. This setssc_cutoff
to 3 for samples where the number of soft clip sites exceeds 200,000. It also setssc_shift
to 10 which sets the distance to consider events the same.-no-optimize
turns optimizations off. This can increase sensitivity, but increases the computational requirements.
The final CICERO fusion result file will be located at <outdir>/CICERO_DATADIR/<sample name>/final_fusions.txt
. Use the following guide to interpret the results.
To visualize CICERO fusion output you can load the final fusion output file at https://proteinpaint.stjude.org/FusionEditor/.
- GNU parallel
- Samtools 1.3.1
- Cap3
- Blat
- Java 1.8.0
- Perl 5.10.1 with libraries:
- base
- Bio
- Carp
- Compress
- Cwd
- Data
- DBI
- diagnostics
- Digest
- English
- enum
- Exporter
- File
- FileHandle
- List
- POSIX
- strict
- Sys
- Tree
- warnings
CICERO can be run with Docker. Pre-built Docker images are provided for each release in GitHub Packages.
Invoke the CICERO wrapper using the Docker image available in GitHub Packages. You will likely need to add an additional bind mount for the output and input (BAM + junctions) files. Note the following command pulls the latest
tag for the Docker image. For reproducible results, it is advisable to specify the exact version to run.
docker run -v <path to reference directory>:/reference ghcr.io/stjude/cicero:latest [-n cores] -b <bam file path> -g <genome> -r /reference -o <output directory> [-j junctions file] [-p] [-s int] [-t int] [-c int]
See Running CICERO for details of the parameters.
CICERO is integrated in the St. Jude Cloud Rapid RNA-Seq workflow. To run CICERO in St. Jude Cloud, access the tool through the platform page. Documentation for running and interpreting results is available in the user guide.
RNApeg is required to generate a junctions file for use by CICERO. You can get RNApeg from both Docker and Singularity. Once RNApeg is complete, the *.junctions.tab.shifted.tab
file can be provided to CICERO using the -j
argument.
RNApeg is authored by Michael N. Edmonson (@mnedmonson).
This software analyzes nextgen RNA sequencing data which has been mapped to whole-genome coordinates, identifying evidence of both known and novel splicing events from the resulting alignments. The raw junction sites in the mapped BAMs undergo postprocessing to correct various issues related to mapping ambiguity. The result is a more compact and consistent set of junction calls, simplifying downstream quantification, analysis, and comparison.
First, the BAM read mappings are analyzed to identify putative junction sites. This produces a list of junction sites along with counts of supporting reads and several associated quality metrics. While reflective of the BAM data, this output typically requires refinement by the following steps.
Novel junctions are compared with reference exon junction boundaries and evaluated for mapping ambiguity which can justify adjusting the sites to match. Even small ambiguities such as the presence of the same nucleotide on either side of a junction can be enough to nudge a prediction that would otherwise perfectly match a reference isoform out of place.
Mapping ambiguity is next evaluated within the novel junctions themselves. Ambiguous junctions are combined where possible, merging their counts of supporting reads and related annotations. This reduces the population of novel junctions while simultaneously improving the evidence for those remaining. Combining evidence for poorly-covered sites also improves the chances of these sites passing the default minimum level of 3 supporting reads required for reporting junctions in the final output.
Additional correction of novel junctions is also performed to identify previously unknown skips of an exon (or exons) within known reference isoforms. Special handling is required in these cases because while the corrected boundaries are known, the events themselves are novel.
Junctions may also be shifted in cases of ambiguity involving a single edge (i.e. junction start or end). While not doubly-anchorable as with known reference junctions or novel skips of known exons, this adjustment can standardize evidence e.g. for novel exons.
Novel junctions are subjected to additional scrutiny before being reported:
- must be supported by a minimum of 3 reads
- at least one read must pass minimum flanking sequence requirements, to avoid false positives near read ends due to insufficient anchoring
- the junction must be either observed bidirectionally, or be supported by very clean alignments (either perfect or with very few high-quality mismatches, insertions, deletions, or soft clips)
While these requirements are minimal, they substantially reduce background noise.
This step pools results for a set of samples and does additional standardization of novel exons based on the combined set. Mostly this has the effect of standardizing ambiguous novel junction sites across samples, but it can occasionally result in combinations of sites as well.
The primary output files are tab-delimited text.
Output is also written in UCSC .bed format, which can be used to visualize the junctions and supporting read counts within the UCSC genome browser.
docker run -v <outdir>:/results ghcr.io/stjude/rnapeg:latest -b bamfile -f fasta -r refflat
fasta
reference genome; i.e. "Homo_sapiens/GRCh38_no_alt/FASTA/GRCh38_no_alt.fa" or "Homo_sapiens/GRCh37-lite/FASTA/GRCh37-lite.fa" from Reference Files.refflat
i.e. "Homo_sapiens/GRCh38_no_alt/mRNA/RefSeq/refFlat.txt" or "Homo_sapiens/GRCh37-lite/mRNA/Combined/all_refFlats.txt" from Reference Files.
singularity run --containall --bind <outdir>:/results docker://ghcr.io/stjude/rnapeg:latest -b bamfile -f fasta -r refflat
You will also need to add --bind
arguments to mount the file paths for bamfile
, fasta
, and refflat
into the container.
Reference files are required to run CICERO. They can be found at the following location:
- GRCh37-lite: https://doi.org/10.5281/zenodo.3817656
- GRCh38_no_alt: https://doi.org/10.5281/zenodo.3894739
CICERO currently supports GRCh37-lite
and GRCh38_no_alt
.
A demo of CICERO can be found at the following location:
Field | Description |
---|---|
sample | Sample ID |
geneA / geneB | gene at breakpoint A / B |
chrA / chrB | chromosome at breakpoint A / B |
posA / posB | coordinate at breakpoint A / B |
ortA / ortB | Mapping strand of assembled contig at breakpoint A / B |
featureA / featureB | 5utr / 3utr / coding / intron / intergenic at breakpoint A / B |
sv_ort | Whether the mapping orientation of assembled contig has confident biological meaning; if confident, then '>', else '?' (e.g. the contig mapping is from sense strand of gene A to antisense strand of gene B). |
readsA / readsB | number of junction reads that support the fusion at breakpoint A / B |
matchA / matchB | contig matched length at breakpoint A / B region |
repeatA / repeatB | repeat score (0~1) at breakpoint A / B region, the higher the more repetitive |
coverageA / coverageB | coverage of junction reads that support the fusion at breakpoint A / B (add the sequence length that can be mapped to the assembled contig for each junction read) |
ratioA / ratioB | MAF of soft-clipped reads at breakpoint A / B (calculate the MAF for plus mapped reads and minus mapped reads, respectively; use the maximum MAF). |
qposA / qposB | breakpoint position in the contig that belongs to A / B part |
total_readsA / total_readsB | total reads number at the breakpoint at breakpoint A / B |
contig | Assembled contig sequence that support the fusion |
type | CTX (interchromosomal translocation) / Internal_dup / ITX (inversion) / DEL (deletion) / INS (insertion) / read_through |
score | Fusion score, the higher the better |
rating | HQ (known fusions) / RT (read_through) / LQ (others) |
medal | Estimated pathogenicity assessment using St. Jude Medal Ceremony. Value: 0/1/2/3/4, the bigger the better |
functional effect | ITD (Internal_dup) / Fusion / upTSS / NLoss / CLoss / other |
frame | 0 (event is not in frame) / 1 (event is in-frame) / 2 (geneB portion contains canonical coding start site (i.e. the entire CDS for geneB)) / 3 (possible 5' UTR fusion in geneB) |
Tian, L., Li, Y., Edmonson, M.N. et al. CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data. Genome Biol 21, 126 (2020). https://doi.org/10.1186/s13059-020-02043-x
Copyright 2020 St. Jude Children's Research Hospital
Licensed under a modified version of the Apache License, Version 2.0 (the "License") for academic research use only; you may not use this file except in compliance with the License. To inquire about commercial use, please contact the St. Jude Office of Technology Licensing at scott.elmer@stjude.org.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.