-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusing about input and output file #20
Comments
Hi Andrew, There is some minimal test data in In the most basic use case teloclip only requires two inputs:
In some of the examples I show the sam output being piped to This tool has not been formally published yet. If you find it useful you can cite the git repo directly and note the release version used in your analysis. User feedback is helpful for improving the tool, so please do let me know if you have and difficulties / successes using it. |
Dear Adam, Thank you for your warm help! And I will try it. By the way, I have tried to break down the command into separate steps, please have a look and check it: Streaming SAM records from alignerThis part based on the command #Mapping pacbio reads to ref genome by using minimap2 #Filtering the SAM file using samtools view to exclude specific flags #Identifying unassembled telomeric repeats in the alignments #Identifying unassembled telomeric repeats in the alignments, and then filter it and output to a bam file Report clipped alignments containing target motifsThis part based on the command #extract the header information from the BAM file #identify unassembled telomeric repeats(TTAGGG), and save results in a new SAM file #sort the alignments by reference position, and saves the sorted data as a BAM file
It is currently midnight in Melbourne. Please take a rest first. Have a good night, and thank you very much. :) |
That's pretty much correct.
Note: Don't use flag 0x2308 to filter, I've updated the docs to recommend 0x100 which just excludes secondary alignments. Filtering is optional.
Yes, in this example the
The bam file here is the same as the raw sam alignments you would get from
The
This is the same as before but the Note:
Yep.
This is an alternative to the previous example, the only difference is that it requires a minimum of 5 repeats of the Note: I've corrected the readme example to describe this use as well as the --nopoly option which will ignore homopolymer tracks.
No, the bam or sam that gets passed to |
Dear Adam, Thanks a lot! And I have successfully get the overhangs. And I would like to ask if you have some recommendation on extending contig? Maybe I need to map the overhangs to its corresponding chromosome, and generate the bam file for IGV? If you are interesting on it, I can provide the files |
Oops, I have checked the overhangs, it seems that the extracted overhang reads' structure is as below: sequence of chromosome end - telomere motif repeat - unknown sequence I have upload all the overhang reads in google drive, how can I share it with you if you need them? |
For extending the contigs/scaffolds I would do something along these lines:
minimap2 -ax map-pb ref.fa pacbio.fq.gz | teloclip --ref ref.fa.fai | samtools sort > all_overhangs.bam
# Filter previous alignments to keep only primary alignments that have at least 1 run of TTAGGGx3 in the clipped region.
samtools view -h -F 0x100 all_overhangs.bam | teloclip --ref ref.fa.fai --motifs TTAGGGTTAGGGTTAGGG | samtools sort > TTAGGGx3_overhang_primary_alignments.bam
samtools view -h TTAGGGx3_overhang_primary_alignments.bam | teloclip-extract --refIdx ref.fa.fai --extractReads --extractDir SplitOverhangs
Your strategy might be slightly different depending on the quality of your starting assembly (how was it created? Have you already done scaffolding with Hi-C or long-read data?) as fragmented assemblies will have many clipped-overhangs that are NOT telomeres. You should also consider the expected telomere structure/motifs for your species. Some species do not have standard telomeric motifs. |
Can you share some of the extracted reads from You can also paste screenshots from IGV here. |
Dear Adam, No problem, please wait a moment, I have a meeting today. By the way I have made a mistake-overhanging region of each read is in lowercase |
Dear Adam, Please check script and overhang sequence: https://drive.google.com/drive/folders/14PMFaGdDTcj6mwMILGV7JlLVT_x8c6ap?usp=drive_link |
Another question is about the contig end. For example, if we conduct juicebox pipeline, once we modify the end of chromosome (or "superscaffold") wrongly, there will be a situation as below:
Do you have any recoomendation on it? Actually I think teloclip can be quite flexible - we can use it for contig extend or even gap closing. I believe teloclip will be a widely used tools! |
Scaffolds can be validated with Hi-C data, and/or by aligning your long read data back to the scaffolded assembly (after gap-filling). In the later case, you would see a bunch of reads soft-clipped at the mis-assembly site. In your example above, teloclip will not report the read as it does not overhang the end of a scaffold. I expect that teloclip will work best on assemblies that have already been (correctly) scaffolded with Hi-C. btw you need to approve access to your gdrive folder. |
Thank you, and I apologize for forgetting to grant you access. I have updated the overhangs sequence. By the way, I noticed that you mentioned using miniasm to generate a contig for manually extending (#10 (comment)). Do you have any hints on how to do that? |
In this case I think your overhangs look pretty consistent with each other. I would try this:
It looks like there is some variation in your telomeric motifs. I see CCCTGAA as well as CCCTAAA. I guess this is a plant? It might be worthwhile running teloclip without motifs as well to see if you pick up extra reads. Once you have your final assembly you could run one of the tools for de novo telomeric repeat prediction. |
Dear Adam, You are right! I am working on a plant genome(~450Mb, het=1.29, 2n=30), and I would follow your advice. But I would like to say some longest sequence seem to have limited number of telomeric motifs(such as Hic_scaffold_10_R). Maybe we can check & polish it with IGV and racon |
Most of your scaffolds with at least 1 overhang detected by teloclip have a read depth of between 1-17X at the end of the scaffold. 10R and 11L have a depth of >450X. I think the ends of these scaffolds are probably repeats and you are getting non-specific alignment. You should inspect the coverage (of all reads mapped to the whole assembly) in IGV to see if these locations just have lots of alignments. At least for 10R I don't think it is a chromosome end. 14R has >90X depth and should probably be checked too. You could also try aligning your reads with Winnowmap which might handle repeats better than Minimap2. I haven't tested it with teloclip yet so let me know how it changes your results. |
The high coverage scaffold ends might also be mitochondrial and chloroplast genomes come to think of it. You should identify and exclude those scaffolds. |
Dear Adam, Apologies for the late reply. I've been occupied with a complex work issue and my wedding ceremony in recent days. I will continue to try it and provide you with feedback. |
Dear Adam,
It is amazing to find such a great tool for telomere anchoring! But I would like to say some expression you mentioned in the tutorial is blurred, especially the input and output file (like in.sam, out.sam and out.bam...). If you can clarify them, I would appreciate it. By the way, do you have any recommendations about citing your work, just github link or you have published it?
Thanks
The text was updated successfully, but these errors were encountered: