This repository is made for eBook entitled “Bioinformatics Recipes for Plant Genomics: Data, Code, and Workflows” in Bio-101
Hi-C is a chromosome conformation capture method that was originally developed to detect genome-wide chromatin interactions. Nowadays, it is widely applied in scaffolding de novo assembled contigs into chromosome-scale genome sequences. Multiple open-source software has been developed to perform genome scaffolding with Hi-C data. The input data is de novo assembled contigs using long-read sequencing or short-read sequencing. Then, Hi-C data is mapped to these contigs and the interact matrix is computed by software to scaffold contigs into chromosome-scale sequences. Different tools have their specific algorithm to calculate the interact matrix, correct misassemblies and misjoins, and may require different dependent packages or running environments. Here, we describe a step-by-step protocol for genome scaffolding using Hi-C data with a comprehensive pipeline: compute interact matrix with Juicer, scaffold contigs with 3D-DNA pipeline, then visualize and modify scaffolding with Juicebox. It is the first detailed protocol that shows how to scaffold using Hi-C data with this pipeline in plants. Compare to many other pipelines, this protocol only requires primarily assembled contigs and raw Hi-C data as inputs. Besides that, it is also compatible with multiple enzymes, and provides visualization and manual correction. Currently, more and more genomes are sequenced combining Hi-C, this step-by-step protocol will be applied widely in mass big eukaryotic genome scaffolding.
Trimmomatic
Juicer
3D-DNA pipeline
Juicebox (v. 1.11.08)
BWA
Samtools
Miniconda
BUSCO
Java 1.8 JDK
Tips: we recommend users use the latest version of each software listed above, except for Juicebox (v. 1.11.08).
- De novo assembly contigs file in FASTA format, contigs used in this study can be downloaded HERE
- Raw Hi-C sequencing data in FASTQ format, fastq files used in this study can be downloaed HERE
More details can be found in the Input folder
a. Linux server or cluster
b. PC with at least 16GB RAM for handling big genomes (>1GB)
Juicer is the software that maps Hi-C paired-end reads to assembled contigs, and generates the Hi-C interact matrix for downstream analysis.
mkdir hic; cd hic
git clone https://github.com/theaidenlab/juicer.git
ln -s juicer/SLURM/scripts/ scripts
cd scripts; wget https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar
ln -s juicer_tools.1.9.9_jcuda.0.8.jar juicer_tools.jar; cd ../
mkdir references
mkdir restriction_sites
Make sure samtools and bwa are in your $PATH
export PATH=your_samtools/samtools:$PATH
export PATH=your_bwa/bwa:$PATH
Tips: Juicer can be run on AWS, LSF, Univa Grid Engine (UGER), SLURM, and a single CPU, users may need to change the command line “ln -s juicer/SLURM/scripts/ scripts” in b. Configure Juicer section to fit their system. For example, use “ln -s juicer/AWS/scripts/ scripts” for AWS scheduler. For macOS users, "curl https://hicfiles.tc4ga.com/public/juicer/juicer_tools.1.9.9_jcuda.0.8.jar --output ./" can be used instead of wget. The same thing can also be applied to all the wget in this protocol.
a. Copy your_contigs.fasta file (or make soft link) into reference path, and index it with bwa index
ln -s your_path/your_contigs.fasta ./references
cd ./references; bwa index your_contig.fasta; cd ..
b. Prepare enzyme site file for your_contigs.fasta
cd restriction_sites
wget https://raw.githubusercontent.com/aidenlab/juicer/main/misc/generate_site_positions.py
Then, Use vi or vim to edit generate_site_positions.py, insert the following line in line 25:
'your_contigs': '../references/your_contigs.fasta',
After that:
python generate_site_positions.py your_enzyme your_contigs
awk 'BEGIN{OFS="\t"}{print $1, $NF}' your_contigs_your_enzyme.txt > your_contigs.chrom.sizes
cd ..
c. Filter and clean raw Hi-C sequencing data
wget https://github.com/usadellab/Trimmomatic/files/5854859/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip
java -jar ./Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads your_threads -phred33 -trimlog trimmomatic.log your_hic_R1.fastq.gz your_hic_R2.fastq.gz your_hic_pair_R1.fastq.gz your_hic_unpair_R1.fastq.gz your_hic_pair_R2.fastq.gz your_hic_unpair_R2.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Tips: your_contigs, your_enzyme, and your_threads are variable, users need to name their own contig file, choose the specific enzyme they used, and specify how many threads they would like to use. your_hic_R1.fastq.gz, your_hic_R2.fastq.gz are the sequenced raw Hi-C paired-end data
mkdir your_contigs_hic; cd your_contigs_hic
mkdir fastq
ln -s ../your_hic_pair* ./fastq/
sh ../scripts/juicer.sh -D $PWD/hic -g your_contigs -s your_enzyme -p ../restriction_sites/your_contigs.chrom.sizes -y ../restriction_sites/your_contigs_your_enzyme.txt -z ../references/your_contig.fasta -Q 2-00:00 -L 7-00:00 -q your_queue_name -l your_long_queue_name -t your_threads -A your_account --assembly
cd ..
Tips: Check juicer.sh, make sure all the Partition, Account, QOS and Threads fit your cluster’s scheduler. To be safe, add these parameters to your command line. Juicer will submit jobs to the cluster through the scheduler automatically. After all the jobs are done, the file named merged_nodups.txt in your_contigs_hic/aligned will be used by 3D-DNA pipeline.
3D-DNA pipeline is designed to correct misassembles and scaffold contigs based on the Hi-C interact matrix. It will generate the scaffolds fasta file, and .hic and .assembly files for visualization in Juicebox.
wget https://github.com/aidenlab/3d-dna/archive/refs/tags/201008.tar.gz
tar -zxf 201008.tar.gz
chmod 554 ./3d-dna-201008/*.sh ##add execute permission
cd your_contigs_hic
../3d-dna-201008/run-asm-pipeline.sh ../references/your_contigs.fasta ./aligned/merged_nodups.txt
After the job is done. Two files named your_contigs.rawchrom.assembly and your_contigs.rawchrom.hic will be used by Juicebox
Tips: If the scaffolding results are not ideal, try different --round different edit round and slightly increase --editor-repeat-coverage misjoin editor threshold repeat coverage. In this case study, we use --editor-repeat-coverage 3.
a. Based on the system, the corresponding Juicebox 1.11.08 version can be downloaded at https://github.com/aidenlab/Juicebox/wiki/Download
b. Download your_contigs.rawchrom.assembly and your_contigs.rawchrom.hic to your PC
c. Run Juicebox, then load your_contigs.rawchrom.hic and your_contigs.rawchrom.assembly in turn (Figure 1)
Figure 1. Steps to load .hic and .assembly file to Juicebox. A. Load .hic file; B. Load .assembly file.
d. Correct scaffolding manually (Figure 2 and Figure 3)
- Shift+left-click to choose the region that needs to be edited.
- Right-click to choose to remove or add chr boundaries.
- Move the mouse to the upper-right of the selected region until a circle appears, and then left-click to rotate the selected region
Figure 2. Examples of how to edit misjoin and misorientation. A. edit misjoin via remove and add chr boundaries. B. edit misorientation by rotating selected contigs.
Figure 3. Manually correct the scaffolding with Juicebox. A. the original scaffolding visilization B. manually corrected scaffolding. Ex. 1: misjoin, Ex. 2: mis-oritention
Tips: Here is a demo video show how to use Juicebox from the software developer.
a. Upload your_contigs.rawchom.edit.assembly to cluster and put it in your_contigs_hic. b. Run 3D-DNA pipeline to obtain your edited scaffolds fasta file
../3d-dna/run-asm-pipeline-post-review.sh -r your_contigs.FINAL.edit.assembly ../references/your_contigs.fasta aligned/merged_nodups.txt
a. Install BUSCO, and download database
conda install -c bioconda busco
mkdir busco_evalue; cd busco_evalue
wget https://busco-data.ezlab.org/v4/data/lineages/embryophyta_odb10.2020-09-10.tar.gz
tar -zxf embryophyta_odb10.2020-09-10.tar.gz
b. Run BUSCO
busco -c your_threads -m genome -i ../your_contigs_hic/your_contigs_arrow_nextpolish_HiC.fasta -o your_contigs_hic_busco -l ./embryophyta_odb10
c. Check the BUSCO value and components (Figure 4)
Figure 4. BUSCO summary information.
It is a free and open source software, licensed under GPLv3