Streaming workflow for evolutionary and comparative genomics
Contributors
Qingxiang Guo
About
The following contents contains source code for the analyses and plots in BMC Biology paper, 2022, “A myxozoan genome reveals mosaic evolution in a parasitic cnidarian”
Abstract
Parasite evolution has been conceptualized as a process of genetic loss and simplification. Contrary to this model, there is evidence of expansion and conservation of gene families related to essential functions of parasitism in some parasite genomes, reminiscent of widespread mosaic evolution where subregions of a genome have different rates of evolutionary change. We found evidence of mosaic genome evolution in the cnidarian Myxobolus honghuensis, a myxozoan parasite of fish, with extremely simple morphology.
Notes
Written by Qingxiang Guo, qingxiang.guo@outlook.com, distributed without any guarantees or restrictions.
Codes
1. Use Blobtools to remove the contamination in myxozoan genome
1.1 Filter the genome to 200 bp
filter_fasta_by_length.pl genome.fasta 200 200000 genome_200.fasta
1.2 Get the coverage info
# Build a index for the genome
nohup bowtie2-build M.honghuensis_200_scaf.fasta index --threads 8 &
# Mapping
nohup bowtie2 -p 24 -x index -1 M.honghuensis_1.fq -2 M.honghuensis_2.fq -k 1 --very-fast-local -S out.sam &
# Convert the style
samtools view -bS out.sam > out.bam
1.3 Get the megablast results
nohup blastn -task megablast -query M.honghuensis_200_scaf.fasta -db nt -culling_limit 5 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle staxids sscinames sskingdoms' -num_threads 48 -evalue 1e-25 -out assembly_megablast_25.out &
# Then add species information to the last line using the following method
ln -s /home/train/public_database/accession2taxid/acc2tax_nucl_all.txt ./
nohup perl -lne '
BEGIN{open UT, "<acc2tax_nucl_all.txt" or die $!; while () { $ut{$2}=$3 if /^(\S+)\t(\S+)\t(\S+)/ } }
{print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< assembly_megablast_25.out \
> assembly_megablast_25.out_taxid &
# Change the output style
awk -v OFS="\t" -F"\t" '{print $1,$17,$12}' assembly_megablast_25.out_taxid > mega.out
1.4 Get the diamond-blastx results
# Make a index
diamond makedb --in uniref90.aa -d uniref90
nohup diamond blastx -q ../../2_assembly/M.honghuensis_200_scaf.fasta --sensitive -k 20 -c 1 --evalue 1e-10 --threads 48 --db /home/train/public_database/uniref90/uniref90.dmnd --out assembly_diamond_10.out &
# Add species information to the last line
ln -s /home/train/public_database/uniref90/uniref90.taxlist ./
perl -lne '
BEGIN{open UT, "<uniref90.taxlist" or die $!; while () { $ut{$1}=$2 if /^(\S+)\t(\S+)$/ } }
{print "$_\t$ut{$1}" if /^\S+\t(\S+)/ and exists $ut{$1}}' \
< assembly_diamond_10.out \
> assembly_diamond_10.out_taxid
# Change the ouput style
awk -v OFS="\t" -F"\t" '{print $1,$13,$12}' assembly_diamond_10.out_taxid > diamond.out
1.5 A database file was then generated based on the genome sequence, blast results, maps results
# File preparation
cd /opt/biosoft/blobtools/M.honghuensis.plot_1
ln -s /home/gqx/genome/M.honghuensis/3_contam_remove/diamond/diamond.out ./
ln -s /home/gqx/genome/M.honghuensis/3_contam_remove/megablast/mega.out ./
ln -s ~/genome/M.honghuensis/2_1_assembly/M.honghuensis_200_scaf.fasta ./
ln -s /home/gqx/genome/M.honghuensis/3_contam_remove/mapping_results/out.bam ./
# Merging and ordering blast results
cat mega.out diamond.out > blast.out
sort_blast_by_query_name.pl blast.out
mv sorted_output blast.out
python2.7 ../blobtools create -i M.honghuensis_200_scaf.fasta -b out.bam -t blast.out -o M.honghuensis_1_blob --names ../names.dmp --nodes ../nodes.dmp
1.6 Statistics performed on the database
python2.7 ../blobtools view -i M.honghuensis_1_blob.blobDB.json -o ./
grep '^##' M.honghuensis_1_blob.blobDB.table.txt ;
grep -v '^##' M.honghuensis_1_blob.blobDB.table.txt | \
column -t -s $'\t'
1.7 Visualization
python2.7 ../blobtools blobplot -i M.wulii_1_blob.blobDB.json -o ./ --format pdf --colours colours.txt
# colour.txt contains the file with the assigned color in the following format:
no-hit,#c6c6c6
other,#ffffff
Nematoda,#dd0000
Arthropoda,#3c78d8
Cnidaria,# 6aa84f
Chordata,#fdfd99
Firmicutes,#fca55d
Proteobacteria,#47a0b3
1.8 Contamination was removed according to blobtools visualization
# Extract the abnormal sequences (score > 200) from the M.honghuensis_1_blob.blobDB.table.txt file. Blast it with nt database and remove those aligned to bacteria and host database
# Adjust the output format
# format.sh
#!/bin/bash
grep '^##' M.honghuensis_1_blob.blobDB.table.txt;\
grep -v '^##' M.honghuensis_1_blob.blobDB.table.txt | \
column -t -s $'\t'
./format.sh > result
1.9 Processing the output file and get the blast results
blob_result_seq_extract.pl M.honghuensis_1_blob.blobDB.table.txt
# Extract the potential contamination reads
extract_seq_from_fasta.pl M.honghuensis_200_scaf.fasta seq_for_blast
mv extracted.fasta contam_candidate.fa
1.10 NT alignment of suspected contaminant sequences
mkdir -p /home/gqx/genome/M.wulii/3_contam_remove/contam_blast
cd /home/gqx/genome/M.wulii/3_contam_remove/contam_blast
cp /home/gqx/genome/M.wulii/3_contam_remove/M.wulii.plot_1/contam_candidate.fa ./
nohup blastn -query contam_candidate.fa -db /home/train/public_database/nt_160828/nt -evalue 1e-5 -max_target_seqs 20 -num_threads 24 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle frames sscinames sskingdoms" -out nt_result &
1.11 Process alignments and add species information
# Extract the best hits
export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr nt_result | sort -u -k1,1 --merge > bestHits
cat bestHits | cut -f2 > acc
perl -p -i -e 's/.(\d)//g' acc
acc2tax -i acc -o result -d /home/train/public_database/accession2taxid/
# Extract the hits with key words (teleostomi, bacteria) and get the accession number
./Teleostomi_Bacteria_extract.pl result
# Get the seq ID from the accession number
cat bestHits | cut -f1,2 > header
perl -p -i -e 's/.(\d)//g' header
./get_contam_from_accesion.pl contam_accession header
# Also remove those reads with coverage lower than 20
./another_blob_result_seq_extract.pl M.honghuensis_1_blob.blobDB.table.txt
# Get the seq_remove_by_bam0
cat ../M.honghuensis.plot_1/seq_remove_by_bam0 final_contam_header > true_bad_list
remove_duplicate.pl true_bad_list
mv duplicate_remove true_bad_list
# true_bad_list contains the final bad list
1.12 Remove the contamination by ID
remove_contaminant_by_ID.pl
~/genome/M.honghuensis/2_1_assembly/M.honghuensis_200_scaf.fasta ../3_contam_remove/contam_blast/ true_bad_list
mv survive.fasta M.wulii_200_clean.fasta
# The contam removal process is finished
2. Fourfold synonymous third-codon transversion (4dTV) analyses
# Please check the 4DTv_calculation.pl in attached Scripts
3. Plot the genome KEGG annotation results
# Extract the KEGG path way information from http://www.genome.jp/kegg/tool/map\_pathway.html
kegg_plot_all.pl kegg.path
# This script will call three other scripts: kegg_plot_step_1.pl, kegg_plot_step_2.pl, kegg_plot_step_3.pl
# It will output words like:
# Seperating original files according to Level1: kegg_plot_step_1.pl....
# Further processing output from former step: kegg_plot_step_2.pl....
# Calculating: kegg_plot_step_3.pl...
# Sorting results...
# Drawing...
# Done!
# There will be three outputs: final_kegg_for_ggplot (TSV output), plot3.pdf (final pdf outpuf), test.csv (CSV output)
4. Plot the radar (spider) chart
install.packages("fmsb")
library(fmsb)
data <- read.csv("transposon.csv",header=T)
data <- data.frame(A=c(data[1]),B=c(data[2]))
radarchart(data,
cglty = 1, # Grid line type
cglcol = "gray", # Grid line color
cglwd = 1, # Line width of the grid
pcol = 4, # Color of the line
plwd = 2, # Width of the line
plty = 1) # Line type of the line
5. Genome completeness analysis using Benchmarking Universal Single-Copy Orthologs (BUSCO)
5.1 Dependencies of BUSCO
# Make sure you have installed HMMER (HMMER > 3.1b2) , BLAST+, Augustus (>3.2.1 or 3.2.2) before using BUSCO
5.2 run BUSCO
nohup python2.7 BUSCO.py -i M.honghuensis_200_scaf.fasta -o M.honghuensis -l /opt/biosoft/busco/metazoa_odb9 -m geno -c 24 &
nohup /opt/biosoft/python_3/bin/python3.6 /opt/biosoft/busco/BUSCO.py -i H_1_all_6_frame -o H_1_all_6_frame -l /opt/biosoft/busco/metazoa_odb9/ -m proteins -c 64 &
python2.7 BUSCO.py -i SEQUENCE_FILE -o OUTPUT_NAME -l LINEAGE -m geno
python2.7 BUSCO.py -i SEQUENCE_FILE -o OUTPUT_NAME -l LINEAGE -m prot
python2.7 BUSCO.py -i SEQUENCE_FILE -o OUTPUT_NAME -l LINEAGE -m tran
# Main result file: short_summary_XXXX.txt
6. Genome completeness analysis using Core Eukaryotic Genes Mapping Approach (CEGMA)
6.1 Install dependencies for CEGMA
# Install Genewise 2.4.1
# Install geneid v1.4
wget ftp://genome.crg.es/pub/software/geneid/geneid\_v1.4.4.Jan\_13\_2011.tar.gz
tar zxf geneid_v1.4.4.Jan_13_2011.tar.gz
rm geneid_v1.4.4.Jan_13_2011.tar.gz
cd geneid/
make
cd bin/
echo "PATH=$PATH:/opt/biosoft/geneid/bin" >> ~/.bashrc
source ~/.bashrc
#Install CEGMA
wget http://korflab.ucdavis.edu/datasets/cegma/CEGMA\_v2.5.tar.gz
tar zxf CEGMA_v2.5.tar.gz
cd CEGMA_v2.5/
make
# If it report the error, “Can't locate Cegma.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at ./cegma line 34”
# To solve this error:
sudo cp Cegma.pm /usr/local/lib64/perl5/
sudo cp FAlite.pm /usr/local/lib64/perl5/
sudo cp geneid.pm /usr/local/lib64/perl5/
sudo cp HMMstar.pm /usr/local/lib64/perl5/
echo "PATH=$PATH:/opt/biosoft/CEGMA_v2.5/bin" >> ~/.bashrc
export CEGMA="/opt/biosoft/CEGMA_v2.5"
export CEGMATMP="/opt/biosoft/CEGMA_v2.5"
export PERL5LIB="$PERL5LIB:$CEGMA/lib"
6.2 Run CEGMA for genomes and transcriptomes
nohup cegma -g WL_all_filter_Unigene.fasta -T 8 &
# CEGMA actually has two databases, 458 eukaryotic conserved sequences, and 248 eukaryotic ultraconserved sequences, the latter is a subset of the former.
# All output results, except “output.completeness_ report” are based on 248 sequences and others are based on 458 sequences.
6.3 Run CEGMA for proteomes
# CEGMA was not developed for proteomes, but there is a tricky way to use it for evaluating the completeness of proteomes.
cd /opt/biosoft/CEGMA_v2.5/data/hmm_profiles
cat ./*.hmm > KOG.hmm
nohup hmmpfam –cpu 32 KOG.hmm ../../MH_MCPID.pep > cegma.output &
parse_hmmpfam_output.pl cegma.output ../../H_MCPID.pep 1e-05 458 /opt/biosoft/CEGMA_v2.5/ > result
7. Use ggplot2 to make the top hit species distribution plot of genome sequences
library(ggplot2)
library(ggthemes)
data <- read.csv("Species_distribution.csv",header=T)
data <- data[seq(30),] # only use the former 30 lines
data <- data.frame(A=c(data[1]),B=c(data[2]),C=c(data[3])) # Make it into a data.frame
data$Species <- factor(data$Species, levels = data$Species[order(data$Count)]) # Reorder the factors
p <- ggplot(data,aes(y=Count, x=Species))
p+geom_bar(stat="identity", width=0.8, fill= "dodgerblue3",colour="black", cex=0.4)+coord_flip()+ theme_set(theme_bw())+labs(title='The top hit species distribution')+theme(line=element_line(size=0.5))+ ylab("BLAST Hit")+geom_text(data=data, aes(x=Species, y=Count, label=paste0(round((Count/829)*100, digits=2), "%")), hjust=-0.2 ,size=3)+ scale_y_continuous(limits=c(0,250))
ggsave("plot2.pdf", width=8, height=8)
8. Use ggplot2 to visualize the Hisat mapping results of genome analysis (polar plot)
data<-read.csv("map.csv",header=T)
data <- data.frame(A=c(data[1]),B=c(data[2])) # convert it into dataframe
p <- ggplot(data,aes(x=Type,y=Number,fill= Type))
p+ geom_bar(stat="identity")
data$Type <- factor(data$Type, levels = c("Total reads","Total mapped reads","Unique mapped reads","Reads mapped in pair","Spliced mapped reads","Multiple mapped reads")) # reorder the factors
p <- ggplot(data,aes(x=reorder(Type,Number), y=Number, fill= Type)) + theme_set(theme_bw()) # use the white theme
p+geom_bar(stat="identity",alpha=0.9,width=0.5,colour='black',cex=0.4)+coord_polar(theta="y")+labs(title = "Mapping summary")+ theme(axis.text.x = element_blank(),axis.text.y = element_blank(),axis.title.x=element_blank(), axis.title.y=element_blank(), legend.title = element_blank(),axis.ticks=element_blank()) +guides(fill =guide_legend(keywidth = 1.1, keyheight = 1.1))
9. Use ggplot2 to make five-way Venn diagram
library(ggplot2)
library(VennDiagram)
# Read sample data
A <- sample(1:1000, 400, replace = FALSE);
B <- sample(1:1000, 600, replace = FALSE);
C <- sample(1:1000, 350, replace = FALSE);
D <- sample(1:1000, 550, replace = FALSE);
E <- sample(1:1000, 375, replace = FALSE);
# Drawing
venn.plot <- venn.diagram(x = list(A = A, B = B, C = C, D = D, E = E), filename = "Venn_5set_pretty.png", imagetype = "png", col = "black", fill = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3"), alpha = 0.50, cex = c(1.5, 1.5, 1.5, 1.5, 1.5, 1, 0.8, 1, 0.8, 1, 0.8, 1, 0.8, 1, 0.8, 1, 0.55, 1, 0.55, 1, 0.55, 1, 0.55, 1, 0.55, 1, 1, 1, 1, 1, 1.5), cat.col = c("dodgerblue", "goldenrod1", "darkorange1", "seagreen3", "orchid3"), cat.cex = 1.5, cat.fontface = "bold", margin = 0.05);
10. Get the statistics of assembled genome
# Please check the genome_statistic.pl in attached Scripts
Please cite this paper as
Guo, Q., Atkinson, S. D., Xiao, B., Zhai, Y., Bartholomew, J. L., & Gu, Z. (2022). A myxozoan genome reveals mosaic evolution in a parasitic cnidarian. BMC biology, 20(1), 1-19.
License
All source code, i.e. scripts/*.pl, scripts/*.sh or scripts/*.py are under the MIT license.