Small python command line tool that generates comparative plots for multiple rnaQUAST short reports.
rnaQUAST (https://github.com/ablab/rnaquast) is a great tool for the evaluation of transcriptome assemblies.
It generates a multitude of metrics for the quality of transcriptome assemblies, many of them by mapping the transcripts to an annotated genome.
rnaQUAST does a great job at rating individual assemblies, however, directly comparing different reports is not as easy.
This tool compares metrics from rnaQUASTs short reports.
positional arguments:
report_dirs paths to output directories from rnaQUAST
options:
-h, --help show this help message and exit
-names NAMES [NAMES ...]
list of names for the assemblies (default=["auto"])
-colors COLORS [COLORS ...]
list of colors in hexcode (default=["auto"])
rnaQUASTcompare.py will generate a folder with the current date and time in the same directory.
I found the metric "Avg. mismatches per transcripts" to favor assemblies with transcripts that are shorter
and replaced it with "Avg. mismatches per aligned kb".
Dataframes combining the data of all short reports in .csv, .tsv and .tex format
Metrics are grouped into four groups: "Gene metrics", "Transcript metrics", "Isoform metrics"
and "Other metrics". For each of them a bar and a line plot will be created.
Additionally a combined plot for all metrics together a bar and line plot for all metrics together is created.
In the comined plot all values are scaled to [0,1], the details of the scaling operations can be found below.
Combined plots for all metrics with scaled values and individual plots for each metrics group.
Example:
A comparison of three transcriptome assembly tools from the same RNA-Seq data.
Value scaling
I divided the metrics into groups:
- Gene metrics
"50%-assembled genes", "95%-assembled genes", "50%-covered genes", "95%-covered genes" - Isoforms metrics
"50%-assembled isoforms", "95%-assembled isoforms", "50%-covered isoforms", "50%-covered isoforms" - Transcript metrics
"Transcripts > 500 bp", "Transcripts > 1000 bp", "Aligned", "Uniquely aligned", "Multiply aligned", "Unaligned", "Misassemblies", "Unannotated", "50%-matched", "95%-matched" - Scaled metrics
"Database coverage", "Avg. aligned fraction", "Mean fraction of transcript matched" - Other metrics
"Transcripts", "Avg. mismatches per aligned kb", "Duplication ratio"
Gene metrics are divided by the number of genes in the genome annotation
Isoforms metrics are divided by the number of isoforms in the genome annotation.
Isoforms metrics are divided by the number of sequences in the respective assembly.
Scaled metrics are left unchanged.
Other metrics are divided by the maximum value for all assemblies.
To me it doesn't really matter if you cite this tool at all. If you think you have to or
want to make others aware of this tool you can refer directly to this repository.
I would also be very pleased if you could let me know if and how you use this tool.