Skip to content

Latest commit

 

History

History
195 lines (179 loc) · 8 KB

File metadata and controls

195 lines (179 loc) · 8 KB

License: GPL v3

Chromosome scaffolding of simple diploid genomes using ALLHiC

High-throughput chromosome conformation capture (Hi-C) technology has become an economical and robust tool to generate a chromosome-scale assembly. However, high-quality chromosome scaffoldings are limited by the short contigs and chimeric contigs, making the assembly quality unsatisfactory. Here we present a Hi-C scaffolding protocol based on ALLHiC, which integrates multiple functions to break chimeric contigs and generate chromosome-scale scaffolds. In addition, we describe a convenient way to curate the remaining mis-assemblies. This pipeline has been successfully applied on many genome projects, including our previously published banyan tree and oolong tea genomes.

Overview of the workflow

Installation

  • Running environment:
    • The major workflow was constructed based on the Linux system.
    • The juicebox is recommended to run on the Windows/MacOS.
  • Required software
  • Installation
    1. Install the conda package manager
      curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      sh Miniconda3-latest-Linux-x86_64.sh
    2. Install the bwa, samtools, bedtools, and Perl via conda.
      conda install -c bioconda bwa samtools bedtools
      conda install -c conda-forge Perl
    3. Install the python package of matplotlib, NumPy, and pandas via conda.
       conda install matplotlib numpy pandas pysam
    4. Install the ALLHiC, 3D-DNA, juicebox_scripts via Github. We recommend installing these packages into ~/software.
      a. ALLHiC
      git clone https://github.com/tangerzhang/ALLHiC.git
      chmod -R 755 ALLHiC
      echo “export PATH=$HOME/software/ALLHiC/bin:$HOME/software/ALLHiC/scripts:$PATH>> ~/.bash_profile
      b. 3D-DNA
      git clone https://github.com/aidenlab/3d-dna.git 
      c. juicebox_scripts
      git clone https://github.com/phasegenomics/juicebox_scripts.git
    5. Install the asmkit and ParaFly.
      a. asmkit
      wget https://github.com/wangyibin/asmkit/releases/download/v0.0.1/asmkit
      mv asmkit ~/bin
      b. ParaFly * The executable file of ParaFly can be found in the directory of lib/
      wget https://sourceforge.net/projects/parafly/files/parafly-r2013-01-21.tgz
      tar xzvf parafly-r2013-01-21.tgz
      cd parafly-r2013-01-21
      ./configure –prefix=`pwd`
      make install
      mv bin/ParaFly ~/bin

Input Data

This pipeline require de novo assembly and Hi-C data.

  • de novo assembly
  • Hi-C data And, a small testing data sets could download from google drive sharing.
    Bio-protocol_2204520_data
    ├── draft.asm.fasta
    ├── Lib_R1.fastq.gz
    ├── Lib_R2.fastq.gz
    └── ref.chr.fasta
    

Major steps

Step 1: Correction of the draft contigs [Optional]

  • Map Hi-C reads to the draft assembly. a. Prepare the genome alignment index of draft assembly.

    bwa index draft.asm.fasta
    samtools faidx draft.asm.fasta

    b. Map the Hi-C reads into draft assembly, meanwhile retain primary alignments, and sort the alignments.

    bwa mem -5SPM -t 10 draft.asm.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \
                            | samtools view -hF 256 - \
                        | samtools sort -@ 10 -o sorted.bam -T tmp.ali
  • Create the index files of the above-sorted alignment file using samtools and break the misjoined contigs by Hi-C signals.

    samtools index -@ 10 sorted.bam
    ALLHiC_corrector -m sorted.bam -r draft.asm.fasta -o seq.HiCcorrected.fasta -t 10

Step 2: Map Hi-C reads to corrected assembly and filter Hi-C signals.

  • Mapping.
    bwa index seq.HiCcorrected.fasta 
    bwa mem -5SPM -t 10 seq.HiCcorrected.fasta Lib_R1.fastq.gz Lib_R2.fastq.gz \
                | samtools view -hF 256 - \
                | samtools sort -@ 10 -o sample.bwa_mem.bam -T tmp.ali
  • Filter the alignments.
    samtools view -bq 30 sample.bwa_mem.bam > sample.unique.bam 
    PreprocessSAMs.pl sample.unique.bam seq.HiCcorrected.fasta HINDIII

Step 3: Partion

ALLHiC_partition -r seq.HiCcorrected.fasta -e HindIII -k 5 -b sample.unique.REduced.paired_only.bam

* -e enzyme_sites, -k number of groups

Step 4: Optimization

k=5
for i in {1..k}
do 
    echo “allhic optimize sample.unique.REduced.paired_only.counts_AAGCTT.${k}g${i}.txt sample.unique.REduced.paired_only.clm”; 
done > cmd.list

ParaFly -c cmd.list -CPU 4

* k: number of groups, which mean the number of chromosomes in your genome.

Step 5: Building

ALLHiC_build seq.HiCcorrected.fasta

Step 6: Curation

  • Create a hic format file which can import into the juicebox

    asmkit bam2links sample.unique.REduced.paired_only.bam out.links
    asmkit agp2assembly groups.agp groups.assembly
    bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh -p false groups.assembly out.links

    * The 3d-dna needs more than 16 Gb memory to run.

  • Juicebox (https://github.com/aidenlab/Juicebox/wiki) is a graphical software that can be run in local machines (Windows or macOS), which can use to curate the mis-assemblies by Hi-C signals. After careful curation, a modified assembly can be exported. a. import the hic file: groups.hic

    b. import assembly file is groups.assembly

    c. manually curation Detail curation methods can learn from a demo video

    d. export assembly group.review.assembly

  • Convert modified assembly into agp location file and create the final chromosome-scale assembly.

    python ~/software/juicebox_scripts/juicebox_scripts/juicebox_assembly_converter.py -a groups.review.assembly -f seq.HiCcorrected.fasta -s

Step 7: Assessment of the final assembly

  • Statistics of the chromosome length and anchoring rate.
    statAGP.pl groups.review.agp
  • Plot the heatmap of whole genome and per chromosome.
    a. Get group length.
    samtools faidx groups.review.fasta 
    cut -f 1,2 groups.review.fasta.fai > len.txt
    b. Only keep chromosomal level assembly for plotting.
    grep Chr len.txt > chrn.list
    c. Plot heatmap in 500-kb resolution and output in pdf format.
    ALLHiC_plot sample.bwa_mem.bam groups.review.agp chrn.list 500k pdf

Expected results

  • The expected results of this workflow are groups.review.agp and groups.review.fasta.

License

It is a free and open source software, licensed under GPLv3.