Chromosome scaffolding of simple diploid genomes using ALLHiC

High-throughput chromosome conformation capture (Hi-C) technology has become an economical and robust tool to generate a chromosome-scale assembly. However, high-quality chromosome scaffoldings are limited by the short contigs and chimeric contigs, making the assembly quality unsatisfactory. Here we present a Hi-C scaffolding protocol based on ALLHiC, which integrates multiple functions to break chimeric contigs and generate chromosome-scale scaffolds. In addition, we describe a convenient way to curate the remaining mis-assemblies. This pipeline has been successfully applied on many genome projects, including our previously published banyan tree and oolong tea genomes.

Overview of the workflow

Installation

Running environment:
- The major workflow was constructed based on the Linux system.
- The juicebox is recommended to run on the Windows/MacOS.
Required software
- Miniconda
- bwa
- samtools (version 1.9)
- bedtools
- ALLHiC
- asmkit
- ParaFly
- 3D-DNA
- juicebox_scripts
- Perl 5
- Python 3
- pysam
- Matplotlib
- Pandas
- Numpy

Installation

Install the conda package manager

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh

Install the bwa, samtools, bedtools, and Perl via conda.

conda install -c bioconda bwa samtools bedtools
conda install -c conda-forge Perl

Install the python package of matplotlib, NumPy, and pandas via conda.
```
 conda install matplotlib numpy pandas pysam
```

Install the ALLHiC, 3D-DNA, juicebox_scripts via Github. We recommend installing these packages into ~/software.
a. ALLHiC

git clone https://github.com/tangerzhang/ALLHiC.git
chmod -R 755 ALLHiC
echo “export PATH=$HOME/software/ALLHiC/bin:$HOME/software/ALLHiC/scripts:$PATH” >> ~/.bash_profile

b. 3D-DNA

git clone https://github.com/aidenlab/3d-dna.git

c. juicebox_scripts

git clone https://github.com/phasegenomics/juicebox_scripts.git

Install the asmkit and ParaFly.
a. asmkit

wget https://github.com/wangyibin/asmkit/releases/download/v0.0.1/asmkit
mv asmkit ~/bin

b. ParaFly * The executable file of ParaFly can be found in the directory of lib/

wget https://sourceforge.net/projects/parafly/files/parafly-r2013-01-21.tgz
tar xzvf parafly-r2013-01-21.tgz
cd parafly-r2013-01-21
./configure –prefix=`pwd`
make install
mv bin/ParaFly ~/bin

Input Data

This pipeline require de novo assembly and Hi-C data.

de novo assembly

Hi-C data And, a small testing data sets could download from google drive sharing.

Bio-protocol_2204520_data
├── draft.asm.fasta
├── Lib_R1.fastq.gz
├── Lib_R2.fastq.gz
└── ref.chr.fasta

Major steps

Step 1: Correction of the draft contigs [Optional]

Map Hi-C reads to the draft assembly. a. Prepare the genome alignment index of draft assembly.

bwa index draft.asm.fasta
samtools faidx draft.asm.fasta

b. Map the Hi-C reads into draft assembly, meanwhile retain primary alignments, and sort the alignments.

bwa mem -5SPM -t 10 draft.asm.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \
                        | samtools view -hF 256 - \
                    | samtools sort -@ 10 -o sorted.bam -T tmp.ali

Create the index files of the above-sorted alignment file using samtools and break the misjoined contigs by Hi-C signals.
```
samtools index -@ 10 sorted.bam
ALLHiC_corrector -m sorted.bam -r draft.asm.fasta -o seq.HiCcorrected.fasta -t 10
```

Step 2: Map Hi-C reads to corrected assembly and filter Hi-C signals.

Mapping.

bwa index seq.HiCcorrected.fasta 
bwa mem -5SPM -t 10 seq.HiCcorrected.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \
            | samtools view -hF 256 - \
            | samtools sort -@ 10 -o sample.bwa_mem.bam -T tmp.ali

Filter the alignments.

samtools view -bq 30 sample.bwa_mem.bam > sample.unique.bam 
PreprocessSAMs.pl sample.unique.bam seq.HiCcorrected.fasta HINDIII

Step 3: Partion

ALLHiC_partition -r seq.HiCcorrected.fasta -e HindIII -k 5 -b sample.unique.REduced.paired_only.bam

* -e enzyme_sites, -k number of groups

Step 4: Optimization

k=5
for i in {1..k}
do 
    echo “allhic optimize sample.unique.REduced.paired_only.counts_AAGCTT.${k}g${i}.txt sample.unique.REduced.paired_only.clm”; 
done > cmd.list

ParaFly -c cmd.list -CPU 4

* k: number of groups, which mean the number of chromosomes in your genome.

Step 5: Building

ALLHiC_build seq.HiCcorrected.fasta

Step 6: Curation

Create a hic format file which can import into the juicebox

asmkit bam2links sample.unique.REduced.paired_only.bam out.links
asmkit agp2assembly groups.agp groups.assembly
bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh -p false groups.assembly out.links

* The 3d-dna needs more than 16 Gb memory to run.

Juicebox (https://github.com/aidenlab/Juicebox/wiki) is a graphical software that can be run in local machines (Windows or macOS), which can use to curate the mis-assemblies by Hi-C signals. After careful curation, a modified assembly can be exported. a. import the hic file: groups.hic

b. import assembly file is groups.assembly

c. manually curation Detail curation methods can learn from a demo video

d. export assembly group.review.assembly

Convert modified assembly into agp location file and create the final chromosome-scale assembly.

python ~/software/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py -a groups.review.assembly -f seq.HiCcorrected.fasta -s

Step 7: Assessment of the final assembly

Statistics of the chromosome length and anchoring rate.
```
statAGP.pl groups.review.agp
```
Plot the heatmap of whole genome and per chromosome.
a. Get group length.
```
samtools faidx groups.review.fasta 
cut -f 1,2 groups.review.fasta.fai > len.txt
```
b. Only keep chromosomal level assembly for plotting.
```
grep Chr len.txt > chrn.list
```
c. Plot heatmap in 500-kb resolution and output in pdf format.
```
ALLHiC_plot sample.bwa_mem.bam groups.review.agp chrn.list 500k pdf
```

Expected results

The expected results of this workflow are groups.review.agp and groups.review.fasta.

License

It is a free and open source software, licensed under GPLv3.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
cache		cache
graphs		graphs
input		input
lib		lib
output		output
workflow		workflow
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
template.Rproj		template.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chromosome scaffolding of simple diploid genomes using ALLHiC

Overview of the workflow

Installation

Input Data

Major steps

Step 1: Correction of the draft contigs [Optional]

Step 2: Map Hi-C reads to corrected assembly and filter Hi-C signals.

Step 3: Partion

Step 4: Optimization

Step 5: Building

Step 6: Curation

Step 7: Assessment of the final assembly

Expected results

License

About

Releases

Packages

Contributors 3

Languages

Bio-protocol/chromosome_scaffolding_of_simple_diploid_genomes_using_ALLHiC

Folders and files

Latest commit

History

Repository files navigation

Chromosome scaffolding of simple diploid genomes using ALLHiC

Overview of the workflow

Installation

Input Data

Major steps

Step 1: Correction of the draft contigs [Optional]

Step 2: Map Hi-C reads to corrected assembly and filter Hi-C signals.

Step 3: Partion

Step 4: Optimization

Step 5: Building

Step 6: Curation

Step 7: Assessment of the final assembly

Expected results

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages