Scaffolder Evaluation

Wrapper scripts to run genome assembly scaffolding tools and scripts to analyse the output for accuracy.

These were used in the paper ["A comprehensive evaluation of assembly scaffolding tools"] review paper, Genome Biology 2014, 15:R42. DOI: 10.1186/gb-2014-15-3-r42. If you use any of the scripts in your own work, please cite this paper. For details on the methods, please see the manuscript. The methods are not explained in this readme.

Wrapper scripts

The Wrapper-scripts/ directory contains a script for each of the tools listed below. Note that none of these tools are included in the code. Each one must be installed separately and the scripts assume that the relevant programs are in your $PATH.

Each script can take up to two sets of read pairs, and assumes that the orientation of these reads is to be "innies" (i.e. like a paired end library). If you have a mate pair library ("outties"), then you will need to reverse complement your reads before running one of the scaffolders.

The scaffolding tools with links to their publications and where to download the code are as follows.

ABySS [code] ABySS code, [paper] ABySS paper
Bambus2 (part of the AMOS package) [code] Bambus2 code, [paper] Bambus2 paper
GRASS [code] GRASS code, [paper] GRASS paper
MIP [code] MIP code, [paper] MIP paper
Opera [code] Opera code, [paper] Opera paper
SCARPA [code] SCARPA code, [paper] SCARPA paper
SGA [code] SGA code, [paper] SGA paper
SOAPdenovo2 [code] SOAPdenovo2 code, [paper] SOAPdenovo2 paper
SOPRA [code] SOPRA code, [paper] SOPRA paper
SSPACE [code] SSPACE code, [paper] SSPACE paper

Run each script without any options to get the usage instructions.

Analysis scripts

The Analysis-scripts/ directory contains scripts to analyse the accuracy of a genome scaffolder.

Installation

Prerequisites:

[Python3] python
The Python package Fastaq Fastaq
The Python package Pysam Pysam
MUMmer MUMmer
Bowtie2 Bowtie2
SAMtools SAMtools
Graphviz Graphviz (optional, only needed for the test cases, not for the real data)
R R
blastall, from NCBI's legacy BLAST executables (optional, this is only used to generate the test cases)

Take a copy of the scripts in the Analysis-scripts/ directory and make sure they are all in your $PATH to run the analysis.

Protocol to analyse a scaffolder's output

The starting point is the following data:

assembly_contigs.fa - a FASTA file of the contigs that are to be scaffolded
reference.fasta - a FASTA file of the reference genome
reads_1.fastq, reads_2.fastq - FASTQ files of the reads that will be used to scaffold the contigs. Note that the reads are not used to analyse the resulting scaffolds. They are only used for the actual scaffolding.

We need to make artificial contigs from the assembly contigs, with sequence tags that allow tracking of the contigs before and after scaffolding. You have two options:

download a tarball of pre-computed data that were used for the paper from

ftp://ftp.sanger.ac.uk/pub/pathogens/mh12/Scaffolder_evaluation/Scaffolder_evaluation_data.tar.gz

run the following three commands to make your own data

To make your data:

Make artificial contigs from the original contigs.

 scaff_test_ctg_to_perfect_ctg.py assembly_contigs.fa reference.fa artificial_contigs

Make sequence tags and only keep the artificial contigs that could be successfully tagged.

 scaff_test_make_unique_tags.py artificial_contigs.fa reference.fa artificial_contigs.tag

We will need to index the reference for later.
```
 samtools faidx reference.fa
```

Run your scaffolding tool of choice on the tagged artificial contigs artificial_contigs.tag.uniquely-tagged.contigs.fa, to make a file of scaffolds called scaffolds.fa. Then check the output of the scaffolder:

scaff_test_check_using_tags.py <insert size> artificial_contigs.tag scaffolds.fa reference.fa.fai check_scaffolds

where <insert size> should be set to the insert size of your paired reads. This makes numerous files called check_scaffolds.*. The important one is

check_scaffolds.log

which has the counts of tag types at its end. The tag types are bitwise flags (in a similar style to the samtools flag). The meanings are:

0 -- correct pair of tags.

1 -- tags originate from same reference sequence, but their orientation in the scaffolds is incorrect.

2 -- tags originate from different reference sequences.

4 -- tags originate from the same reference sequence but are the wrong distance apart.

8 -- tags originate from the same reference sequence but are not in the correct order.

16 -- this was used for debugging and can be ignored.

For example, a flag of 5 means that 4+1 happened, i.e. a pair of tags that originated from the same reference sequence, but their orientation and order were incorrect. Similarly, 12=8+4 means that two tags were from the same reference sequence, but were the wrong distance apart and in the wrong order. Finally, "lost" means that the tag was not found in the output of the scaffolder and "skipped" means a pair of correct tags, but they skipped over another tag.

Test cases

Make test case contigs and reads for scaffolding using

scaff_test_make_test_cases.sh <seed>

where <seed> is one of 1,2,...,5 to make one of the 5 repetitions of the data. This will make all the files for the 11 test cases in the current directory.

From the same directory that scaff_test_make_test_cases.sh was run, you can run all of the supported scaffolders on test case n using

scaff_test_run_test_case.sh output_directory n

Or run all of them (if you are using bash) with

for n in {1..11}; do scaff_test_run_test_case.sh test.$n.OUT $n; done

If you want to run a new scaffolder on a test case, the important files for test case n are:

test.n.ref.fa - the 'reference' genome
test.n.contigs.fa.tag.uniquely-tagged.contigs.fa - the tagged contigs for scaffolding
test.n.reads_1.fq, test.n.reads_2.fq - the forward and reverse reads to use for scaffolding. Alternatively, the same reads are also in interleaved format in the file test.n.reads.fq.

Run the scaffolder to make a file of scaffolds called scaffolds.fa. The minimum, average and maximum insert sizes are 350, 500 and 650. Use 0.4 for the 'standard deviation' of insert size with SSPACE. Check the accuracy of the scaffolds with the following.

scaff_test_check_using_tags.py 500 test.n.contigs.fa.tag scaffolds.fa test.n.ref.fa.fai check_scaffolds
scaff_test_check_using_tags_get_contig_layout.py check_scaffolds check_scaffolds.graph

The second script made a graph called check_scaffolds.graph.pdf using Graphviz, so you can easily see what the scaffolder did.

Note that if you use SGA on these test cases, it will break unless you specify -b N when running sga-astat.py, where N is the number of contigs to be scaffolded. In the wrapper script, this can be set using -astat_ops "-b N".

Simulated data

Generate simulated contigs from the S. aureus genome from the GAGE dataset, excluding its plasmid sequences:

wget http://gage.cbcb.umd.edu/data/Staphylococcus_aureus/Data.original/genome.fasta
samtools faidx genome.fasta
samtools faidx genome.fasta NC_010079 > genome_no_plasmid.fa

Then generate 3kb and 10kb contigs from this sequence:

scaff_test_make_simulated_contigs_from_genome.py genome_no_plasmid.fa 300 10000 genome_no_plasmid.fake_contigs.10kb_300gap.fa
scaff_test_make_simulated_contigs_from_genome.py genome_no_plasmid.fa 50 3000 genome_no_plasmid.fake_contigs.3kb_50gap.fa

Generate reads with scripts from the Fastaq package:

fastaq_to_perfect_reads genome_no_plasmid.fa fake_reads.3kb.fq 3000 200 20 76
fastaq_to_perfect_reads genome_no_plasmid.fa fake_reads.500bp.fq 500 30 20 76
fastaq_deinterleave fake_reads.3kb.fq fake_reads.3kb_1.fq fake_reads.3kb_2.fq
fastaq_deinterleave fake_reads.500bp.fq fake_reads.500bp_1.fq fake_reads.500bp_2.fq

We now have our reference sequence, contigs and paired reads. The analysis proceeds as above in the "Protocol to analyse scaffolds" section.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Analysis-scripts		Analysis-scripts
Wrapper-scripts		Wrapper-scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scaffolder Evaluation

Wrapper scripts

Analysis scripts

Installation

Protocol to analyse a scaffolder's output

Test cases

Simulated data

About

Releases

Packages

Languages

License

martinghunt/Scaffolder-evaluation

Folders and files

Latest commit

History

Repository files navigation

Scaffolder Evaluation

Wrapper scripts

Analysis scripts

Installation

Protocol to analyse a scaffolder's output

Test cases

Simulated data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages