Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you explain in detail how to extend reads? #25

Open
zhangwenda0518 opened this issue Feb 24, 2024 · 3 comments
Open

Could you explain in detail how to extend reads? #25

zhangwenda0518 opened this issue Feb 24, 2024 · 3 comments

Comments

@zhangwenda0518
Copy link

Dear autor:
I am assembling a plant genome, and now it is assembled to the level without gap. It have 21 chromosomes, but so far only 39 have been counted, and there are 3 motifs (AAACCCT) without telomeres.
Seeing the tools you developed, I think it will save a lot of manual work. I have tried step by step,
1. First index the reference assembly
samtools faidx hifiasm.fasta
2. Reading alignments from SAM file
minimap2 -ax map-hifi -t 120 hifiasm.fasta ../hifi/all.hifi.fasta | samtools view -@ 40 -h -F 0x100 >in.sam
3. Report clipped alignments containing target motifs
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam| samtools sort -@ 60 > out.motifs.bam
4. Extract clipped reads
teloclip --ref hifiasm.fasta.fai --motifs AAACCCT in.sam | teloclip-extract --refIdx hifiasm.fasta.fai --extractReads --extractDir SplitOverhangs

but I am not familiar with the use of miniasm. the result is empty !

minimap2 -t 60 -x ava-pb B01_L.fasta no-gap.id_B01.fasta > telo-read.paf 
miniasm -f no-gap.id_B01.fasta telo-read.paf > telo-min.gfa 
awk '/^S/{print ">"$2"\n"$3}' telo-min.gfa | seqkit seq > telo-min.fasta 

Could you give me some suggestions? Thank you very much.

@zhangwenda0518
Copy link
Author

zhangwenda0518 commented Feb 24, 2024

The run log of miniasm

[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.0131.07] read 32554 hits; stored 7950 hits and 3 sequences (16454807 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.014
1.07] 3 query sequences remain after sub
[M::ma_hit_cut::0.0141.07] 7816 hits remain after cut
[M::ma_hit_flt::0.014
1.07] 3386 hits remain after filtering; crude coverage after filtering: 638.37
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.0141.06] 3 query sequences remain after sub
[M::ma_hit_cut::0.014
1.06] 3386 hits remain after cut
[M::ma_hit_contained::0.014*1.06] 1 sequences and 0 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 0 arcs

[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.3-r179
[M::main] CMD: miniasm -f no-gap.id_B01.fasta read.paf
[M::main] Real time: 0.034 sec; CPU: 0.035 sec

@zhangwenda0518
Copy link
Author

Another question is, what does the small base in the extracted sequence represent?
image

@Adamtaranto
Copy link
Owner

I guess you solved your minimap2 issue if you have generated output from teloclip-extract.

Please try this example workflow.

Make sure you have a stable release version for teloclip installed. You will also need samtools and minimap2.

# Install from PyPi
pip install teloclip
# Check version is 0.0.4
teloclip --version
#teloclip 0.0.4

First, it is advisable to view all reads that that overhang the end of a contig (no filtering for telomeric motifs).
You can do this by visualising the output file primary_overhangs.bam from this command in an alignment viewer i.e. IGV or Geneious.

# Create index of reference fasta
samtools faidx ref.fa

# Map hifi reads with mm2; 
# exclude secondary alignments; 
# keep only alignments that overhang the end of a contig;
# sort alignments and save as bamfile.
minimap2 -ax map-hifi ref.fa hifi_reads.fq.gz | samtools view -h -F 0x100 | teloclip --ref ref.fa.fai | samtools sort > primary_overhangs.bam

Some contigs will have MANY more overhangs than others, these are likely to be mitochondrial or chloroplast genomes. Be careful to identify these contigs and exclude them from extension steps.

Next, you can filter the primary overhang alignments for reads that contain the motif 'AAACCCT' in the overhang section by running the previous output through teloclip again with the --motifs option on.

We can then pass these filtered alignments to teloclip-extract which will write the reads to separate fasta files for each reference contig end.

samtools view -h primary_overhangs.bam | teloclip --ref ref.fa.fai --motifs AAACCCT | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs

You will have output file that look like this for each contig where at least one overhang with the motif was found:

# Overhang reads for the Left and Right ends of chrom 14
SplitOverhangs/14_L.fasta
SplitOverhangs/14_R.fasta

The soft-clipped or "overhang" segment of the read will be in lowercase letters. This will be at the start for reads that overhang the left end of the contig, or at the end for reads overhanging the right end of the contig.

To manually extend the contigs, you should first check the BAM file to see that:

  • the alignments are well anchored on the contig,
  • the overhangs generally agree with each other,
  • the overhangs actually contain runs of your query motif (if not try running teloclip with multiple copies of your repeat i.e. --motifs AAACCCTAAACCCTAAACCCT)
  • and that the contig end does not have an unusually high number of alignments (this may mean you have a circular contig or a contig ending in a repeat element).

If the alignments pass this check you can usually select the longest overhang region (lowercase in the teloclip-extract output) and paste these bases onto the end of your contig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants