All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
pandora
mapping has been improved by doing a better detection of conflicting clusters and selection [#344];
- Parameter
--min-gene-coverage-proportion
topandora
map
,compare
anddiscover
subcommands [#351]; - Parameter
--no-gene-coverage-filtering
topandora
map
,compare
anddiscover
subcommands [#352]; - Parameter
--partial-matching-lower-bound
topandora
map
,compare
anddiscover
subcommands [#353];
This version is a major release that breaks backwards compatibility with previous versions of pandora
.
It improves pandora
runtime performance by 15x and RAM usage by 20x;
- The
pandora
index changed from a set of files in a directory structure to a single, compressible and indexablezip
file (pandora
indexes now have the suffix.panidx.zip
). This is now the single file that is produced by thepandora index
command and is required as argument to all the otherpandora
commands. This index is self contained in the sense that it encodes all the information and metadata about it (e.g. which PRGs were used to create it, window and kmer size, etc). This new index provide the infrastructure for the next features and simplifies working with large reference pangenome collections, with a few million PRGs. This new index breaks backwards compatibility with previouspandora
versions. The structure of this zip archive is as follows:_prg_names
: The names of the PRGs used as input to create this index;_prg_max_path_lengths
: the length of the longest path through each PRG;_prg_lengths
: the length of the string representation of each PRG;_minhash
: the minimizer hash data structure;_metadata
: metadata about the index;*.gfa
: the several GFA files describing the minimizing kmer graph for each PRG;*.fa
: the string representation of each PRG;
- Minimum C++ standard upgraded from
C++11
toC++14
; - We now test whether the genotype confidence of a variant is greater than or equal to the threshold provided by
--gt-conf
. Previously we only tested if it was greater than;
- Removed CLI parameters
-w
,-k
and--clean
from the followingpandora
subcommands:compare
,discover
,map
,seq2path
; - Removed
merge_index
subcommand; - Removed gene-DBG and noise-filtering modules;
- Fixed a major bug on finding the longest path through PRGs;
- Several refactorings to the
pandora
index implementation; - Optimisation of the
pandora
index data structure;
- A memory-efficient way to load PRGs when indexing and mapping, where we don't need to load all PRGs at once to process them, but just load on demand (also known as lazy loading). This is particularly useful when working with very large PanRGs;
- Random multimapping of reads if they map equally well to several graphs, reducing mapping bias. Added parameter
--rng-seed
topandora map/compare/discover
commands to make multimapping deterministic, if required; - A new parameter to deal with auto-updating error rate and kmer model (see
--dont-auto-update-params
parameter inpandora map/compare/discover
commands); - Three new parameters to control when a gene should be filtered out due to too low or too high coverage (see
--min-abs-gene-coverage
,--min-rel-gene-coverage
and--max-rel-gene-coverage
parameters inpandora map/compare/discover
commands);
- Denovo discovery is now done by repeatedly polishing the loci's maximum likelihood sequences using the regions of the reads that mapped to the loci through Racon;
- Pandora
discover
CLI heavily changed: parameters-M,--mapped-reads
,--clean-dbg
,--discover-k
,--max-ins
,--covg-threshold
,-l
,-L
,-d,--merge
,-N
,--min-dbg-dp
removed;
- Pandora
map
,compare
anddiscover
commands now produce SAM files; - Parameter
-K
/--debugging-files
to pandoramap
,compare
anddiscover
commands to create extra debugging files, which are able to describe completely the mapping process ofpandora
.
- The VCF INFO field
SVTYPE
has now been changed toVC
[#249]
- More robust TSV file parsing. Empty line no longer required at end [#213]
- Handle ambiguous bases properly instead of skipping to next read once we reach one [#294]
pandora
is now installable throughconda
;- A script to archive the
pandora
repository with git submodules;
- Improved the sample example so now we can assert that the output produced is the expected one;
- Changes to the build process that enables
pandora
to be compiled in theconda
environment;
- Version bump from
0.9.0-rc2
to0.9.0
.
pandora discover
now processes one sample at a time, but runs with several threads on the heavy tasks, i.e. when mapping reads, finding candidate regions, and finding denovo variants. The result is that it now takes a lot less RAM to run on multiple samples.
pandora discover
now receives read index files describing samples and reads, and discover denovo sequences in these samples. To improve performance on discovering denovo sequences on several samples,pandora discover
is now multithreaded, but the performance is still the same as the previous version, i.e. each sample is processed in a single-threaded way;pandora discover
output changed to a proprietary format. See example for the new output;pandora
can now communicate with amake_prg
prototype that is able to update PRGs without needing to realign and remake the PRG. This provides major performance upgrades to running the fullpandora
pipeline with denovo discovery enabled, and there is no need anymore to use asnakemake
pipeline (see this example to how to run the full pipeline);- We now use musl libc instead of
Holy Build Box to build a precompiled
portable binary, removing the dependency on
OpenMP 4.0+
orGCC 4.9+
, andGLIBC
;
- We now provide a script to build a portable precompiled binary as another option to
run
pandora
easily. The portable binary is now provided with the release; pandora
can now provide a meaningful stack trace in case of errors, to facilitate debugging (need to pass flag-DPRINT_STACKTRACE
toCMake
). Due to this, we now add debug symbols (-g
flag) to everypandora
build type, but this does not impact performance. The precompiled binary has this enabled.
- We now use the Hunter package manager, removing
the requirement of having
ZLIB
andBoost
system-wide installations; GATB
is now a git submodule instead of an external project downloaded and compiled during compilation time. This means that when git cloningpandora
,cgranges
andGATB
are also downloaded/cloned, and when preparing the build (runningcmake
),Hunter
downloads and installsBoost
,GTest
andZLIB
. Thus we still need internet connection to prepare the build (runningcmake
) but not for compiling (runningmake
).- We now use a GATB fork that accepts a
ZLIB
custom installation; - Refactored all thirdparty libraries (
cgranges
,GATB
,backward
,CLI11
,inthash
) into their own directorythirdparty
.
- Refactored asserts into exceptions, and now
pandora
can be compiled correctly in theRelease
mode. The build process is thus able to create a more optimized binary, resulting in improved performance. - Don't assume Nanopore reads are longer than loci [#265]
There is a significant amount of changes to the project between version 0.6 and this release. Only major things are listed here. Future releases from this point will have their changes meticulously documented here.
discover
subcommand for de novo variant discovery [#234]- many more tests
- FASTA/Q files are now parsed with
klib
[#223] - command-line interface is now overhauled with many breaking changes [#224]
- global genotyping has been made default [#220]
- Various improvements to VCF-related functions
- k-mer coverage underflow bug in
LocalPRG
[#183]