Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tetraploid with high heterozygosity #708

Open
Liyong-Zhang opened this issue Sep 24, 2024 · 1 comment
Open

Tetraploid with high heterozygosity #708

Liyong-Zhang opened this issue Sep 24, 2024 · 1 comment

Comments

@Liyong-Zhang
Copy link

Hi there,

I am using Hifiasm (version 0.19.7-r598) to assemble a plant genome (2n=28) with HiFi and Hi-C data.
First, assuming it’s a diploid plant, so I run with command: hifiasm -o OS010681.asm -t 64 --h1 hic_r1.fastq.gz --h2 hic_r2.fastq.gz OS010681.hifi.fq.gz

The result file OS010681.asm.hic.p_ctg.gfa was used for running a mummerplot with A. thaliana genome as reference
mummerplot_v6

According to the mummerplot, this plant looks like a tetraploid. To check the heterozygous rate, I run GenomeScope2 with the HiFi read (
p4_transformed_linear_plot
p4_summary.txt), its heterozygous rate is quite high (~8%).

Then, I re-read the FAQs before re-run the assembly. (https://hifiasm.readthedocs.io/en/latest/faq.html#which-types-of-assemblies-should-i-use) mentioned “if Hi-C data is available, hic.hap.p_ctg.gfa produced in Hi-C mode is the best choice”, and (https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported) mentioned that “ The *r_utg.gfa and *p_utg.gfa are lossless so that they also work for polyploid genomes. However, currently the contig-generation modules of hifiasm are designed for diploid samples, which means both the partially phased assembly and the fully-phased assembly does not directly support polyploid genomes”.

I also refer issues #571, then I re-run hifiasm with command hifiasm -o OS010681.asm.v2 -t 64 -s 0.25 --n-hap 4 --h1 hic_r1.fastq.gz --h2 hic_r2.fastq.gz OS010681.hifi.fq.gz.
I got OS010681.asm.v2.hic.p_ctg.gfa (276M) with four hap files:
OS010681.asm.v2.hic.hap1.p_ctg.gfa (296M), OS010681.asm.v2.hic.hap2.p_ctg.gfa (267M), OS010681.asm.v2.hic.hap3.p_ctg.gfa(276M), OS010681.asm.v2.hic.hap4.p_ctg.gfa (353M).

In #431, you mentioned that ”If you have HiC reads, the latest release Hifiasm-0.19.3-r572 will give you 4 haplotypes. But the results might be not perfect right now”, I am quite confused right now, which assembly files should I use for further scaffolding in yahs?

Also, I am wondering whether you could help me with the following questions as well:
Q1, I noticed that the OS010681.asm.v2.hic.p_ctg.gfa (276M) is much smaller than the previous run OS010681.asm.hic.p_ctg.gfa (387M). What causes this difference?

Q2, (https://hifiasm.readthedocs.io/en/latest/faq.html#are-polyploid-genomes-supported) mentioned “The *r_utg.gfa and *p_utg.gfa are lossless so that they also work for polyploid genomes”, I am wondering what’s difference between *p_utg.gfa vs *p_ctg.gfa? How could I use the information from p_utg.gfa for my polyploid assembly?

Q3, #431, you mentioned that “mannually set --hom-cov to the homozygous coverage”, could you clarify how big the impact is by manually setting the hom-cov value? Also please provide a little bit more details about how to calculate the homozygous coverage if possible?

Q4, #537, you mentioned that “-l0 is designed for the homozygous sample, which will disable diploid phasing. Please do not use -l0 for the Hi-C phasing”. What’s the default value for -l in Hi-C assembly when run Hifiasm?

Sorry about the long question list, thank you so much for your help!

@Mills33
Copy link

Mills33 commented Oct 3, 2024

Hello i am afraid I cant answer all your questions however I have recently assembled a highly heterozygous tetraploid using hifiasm. I used HiFiasm but used the utg assembly for scaffolding. This is because the utg are haplotype specific but you cannot guarantee that the contigs are. Unitigs can be thought of as high confidence contigs in that they have no conflicts. When you join Unitigs you get contigs and the assembler has to make certain decisions when for example it reaches a bubble in the graph (i.e. a heterozygous site) the assembler will choose one of the four (if its a tetraploid) alleles and the others will be considered part of the alternate assembly. ctg_p - is this primary contig assembly so whenever there was a bubble it will only give you one you can output the alternate assembly as well using a flag in hifiasm. I used an older version of HiFiasm and found that the phasing didnt work as well for separating out the 4 haplotypes however I have not tested the current version.

We currently have a preprint out showing how we dealt with a highly heterozygous tetraploid which you may find useful: https://www.biorxiv.org/cgi/content/short/2024.09.25.614935v1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants