14 NCTC samples for testing PacBio assemblers. They are the 14 samples used in the Circlator paper.

Assemblies and files for analysis are all in this github repository. Raw sequencing reads are in the ENA. The filtered subreads FASTQ and corrected reads FASTA files made when running HGAP are available from

The file sample_data.tsv lists accession IDs for the raw reads and the reference assembly of each sample, and some basic stats (assembly size, number of reads etc).

Each directory NCTCxxxxx/ contains all the files relating to that sample. The FASTA files in each directory are:

  • ref.fa - the reference sequence
  • canu.1.{0,1}.fa - as assembly made with versions 1.0, 1.1 of canu
  • miniasm.0.2.fa - an assembly made with miniasm (preprint here), and miniasm.0.2.quiver.fa is the result of running quiver.
  • hgap.fa - an assembly made with HGAP (publication here).
  • sprai. - an assembly made with version of Sprai

Canu assemblies

Made with canu version 1.0 and 1.1. The filtered subreads were used as input with

-pacbio-raw filtered_subreads.fq

and the genome size was set to the length of the reference genome for each sample, using


where $length was taken from the file sample_data.tsv.

The only other options changed were cluster-specific:

maxThreads=8 maxMemory=16 useGrid=0

HGAP assemblies

Details TBC...

miniasm assemblies

Made with version 0.2 (and minimap version 0.2) using the filtered subreads output during a run of HGAP. The three commands run were:

minimap -Sw5 -L100 -m0 -t4 $reads $reads | gzip -1 > miniasm.paf.gz
miniasm -f $reads miniasm.paf.gz > miniasm.gfa
awk '$1=="S" {print ">"$2"\n"$3} ' miniasm.gfa > miniasm.fa

where $reads is the FASTQ file of reads, and the final output FASTA file of contigs is called miniasm.0.2.fa.

Each miniasm assembly has had quiver run on it. The FASTA file is called miniasm.0.2.quiver.fa.

Sprai assemblies

Made with version of Sprai using this wrapper script with the options --threads 8 --memory 16. Sprai runs Celera. Version 8.3rc2 of Celera was used. For each sample, the genome length given to the wrapper script was taken from the file sample_data.tsv.

To do

  • Gather HGAP assembler version/options etc
  • Add PBcR assemblies
  • Run Quast on all assemblies/refs


