High-throughput chromosome conformation capture (Hi-C) technology has become an economical and robust tool to generate a chromosome-scale assembly. However, high-quality chromosome scaffoldings are limited by the short contigs and chimeric contigs, making the assembly quality unsatisfactory. Here we present a Hi-C scaffolding protocol based on ALLHiC, which integrates multiple functions to break chimeric contigs and generate chromosome-scale scaffolds. In addition, we describe a convenient way to curate the remaining mis-assemblies. This pipeline has been successfully applied on many genome projects, including our previously published banyan tree and oolong tea genomes.
- Running environment:
- The major workflow was constructed based on the Linux system.
- The juicebox is recommended to run on the Windows/MacOS.
- Required software
- Installation
- Install the conda package manager
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh
- Install the bwa, samtools, bedtools, and Perl via conda.
conda install -c bioconda bwa samtools bedtools conda install -c conda-forge Perl
- Install the python package of matplotlib, NumPy, and pandas via conda.
conda install matplotlib numpy pandas pysam
- Install the ALLHiC, 3D-DNA, juicebox_scripts via Github. We recommend installing these packages into
~/software
.
a. ALLHiCb. 3D-DNAgit clone https://github.com/tangerzhang/ALLHiC.git chmod -R 755 ALLHiC echo “export PATH=$HOME/software/ALLHiC/bin:$HOME/software/ALLHiC/scripts:$PATH” >> ~/.bash_profile
c. juicebox_scriptsgit clone https://github.com/aidenlab/3d-dna.git
git clone https://github.com/phasegenomics/juicebox_scripts.git
- Install the
asmkit
andParaFly
.
a. asmkitb. ParaFly * The executable file of ParaFly can be found in the directory of lib/wget https://github.com/wangyibin/asmkit/releases/download/v0.0.1/asmkit mv asmkit ~/bin
wget https://sourceforge.net/projects/parafly/files/parafly-r2013-01-21.tgz tar xzvf parafly-r2013-01-21.tgz cd parafly-r2013-01-21 ./configure –prefix=`pwd` make install mv bin/ParaFly ~/bin
- Install the conda package manager
This pipeline require de novo assembly and Hi-C data.
- de novo assembly
- Hi-C data
And, a small testing data sets could download from google drive sharing.
Bio-protocol_2204520_data ├── draft.asm.fasta ├── Lib_R1.fastq.gz ├── Lib_R2.fastq.gz └── ref.chr.fasta
-
Map Hi-C reads to the draft assembly. a. Prepare the genome alignment index of draft assembly.
bwa index draft.asm.fasta samtools faidx draft.asm.fasta
b. Map the Hi-C reads into draft assembly, meanwhile retain primary alignments, and sort the alignments.
bwa mem -5SPM -t 10 draft.asm.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \ | samtools view -hF 256 - \ | samtools sort -@ 10 -o sorted.bam -T tmp.ali
-
Create the index files of the above-sorted alignment file using samtools and break the misjoined contigs by Hi-C signals.
samtools index -@ 10 sorted.bam ALLHiC_corrector -m sorted.bam -r draft.asm.fasta -o seq.HiCcorrected.fasta -t 10
- Mapping.
bwa index seq.HiCcorrected.fasta bwa mem -5SPM -t 10 seq.HiCcorrected.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \ | samtools view -hF 256 - \ | samtools sort -@ 10 -o sample.bwa_mem.bam -T tmp.ali
- Filter the alignments.
samtools view -bq 30 sample.bwa_mem.bam > sample.unique.bam PreprocessSAMs.pl sample.unique.bam seq.HiCcorrected.fasta HINDIII
ALLHiC_partition -r seq.HiCcorrected.fasta -e HindIII -k 5 -b sample.unique.REduced.paired_only.bam
* -e
enzyme_sites, -k
number of groups
k=5
for i in {1..k}
do
echo “allhic optimize sample.unique.REduced.paired_only.counts_AAGCTT.${k}g${i}.txt sample.unique.REduced.paired_only.clm”;
done > cmd.list
ParaFly -c cmd.list -CPU 4
* k
: number of groups, which mean the number of chromosomes in your genome.
ALLHiC_build seq.HiCcorrected.fasta
-
Create a hic format file which can import into the juicebox
asmkit bam2links sample.unique.REduced.paired_only.bam out.links asmkit agp2assembly groups.agp groups.assembly bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh -p false groups.assembly out.links
* The 3d-dna needs more than 16 Gb memory to run.
-
Juicebox (https://github.com/aidenlab/Juicebox/wiki) is a graphical software that can be run in local machines (Windows or macOS), which can use to curate the mis-assemblies by Hi-C signals. After careful curation, a modified assembly can be exported. a. import the hic file:
groups.hic
b. import assembly file is
groups.assembly
c. manually curation Detail curation methods can learn from a demo video
-
Convert modified assembly into agp location file and create the final chromosome-scale assembly.
python ~/software/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py -a groups.review.assembly -f seq.HiCcorrected.fasta -s
- Statistics of the chromosome length and anchoring rate.
statAGP.pl groups.review.agp
- Plot the heatmap of whole genome and per chromosome.
a. Get group length.b. Only keep chromosomal level assembly for plotting.samtools faidx groups.review.fasta cut -f 1,2 groups.review.fasta.fai > len.txt
c. Plot heatmap in 500-kb resolution and output in pdf format.grep Chr len.txt > chrn.list
ALLHiC_plot sample.bwa_mem.bam groups.review.agp chrn.list 500k pdf
- The expected results of this workflow are
groups.review.agp
andgroups.review.fasta
.
It is a free and open source software, licensed under GPLv3.