A method for measuring allele-specific TL and characterizing telomere variant repeat (TVR) sequences from long reads.
If this software has been useful for your work, please cite us at:
Stephens, Z., & Kocher, J. P. (2024). Characterization of telomere variant repeats using long reads enables allele-specific telomere length estimation. BMC bioinformatics, 25(1), 194.
https://link.springer.com/article/10.1186/s12859-024-05807-5
Telogator2 dependencies can be easily installed via conda:
# create conda environment
conda env create -f conda_env_telogator2.yaml
# activate environment
conda activate telogator2
python telogator2.py -i input.fq \
-o results/ \
--minimap2 /path/to/minimap2
-i
accepts fa, fa.gz, fq, fq.gz, or bam (multiple can be provided, e.g. -i reads1.fa reads2.fa
). For Revio reads sequenced with SMRTLink13 and onward, we advise including both the "hifi" BAM and "fail" BAM as input to Telogator2.
An aligner executable must be specified, via either --minimap2
, --winnowmap
, or --pbmm2
.
Sequencing platforms have different sequencing error types, as such we recommend running Telogator2 with different options based on which platform was used:
PacBio Revio HiFi (30x) - -r hifi -n 4
PacBio Sequel II (10x) - -r hifi -n 3
Nanopore R10 (30x) - -r ont -n 4
For Nanopore reads generated using telomere enrichment methods, such as those described by Karimian et al., we recommend using -r ont -n 5 -tt 0.100 --collapse-hom 1000
.
Telogator2 may be unable to analyze older Nanopore data, as reads basecalled with Guppy have prohibitively high sequencing error rates in telomere regions.
Telomere reads for HG002 can be found in the test_data/
directory.
HiFi reads (~70x): hg002-telreads_pacbio.fa.gz
ONT reads (~25x): hg002-telreads_ont.fa.gz
These are full-sized datasets and may take awhile to run. A smaller input dataset (e.g. for just checking that Telogator2 successfully runs) is also provided: test_data/test.fa.gz
.
The primary output files are:
tlens_by_allele.tsv
allele-specific telomere lengthsall_final_alleles.png
plots of all alleles (TVR + telomere regions)violin_atl.png
violin plot of ATLs at each chromosome arm
The main results are in tlens_by_allele.tsv
, which has the following columns:
chr
anchor chromosome arm- subtelomeres that could not be aligned are labeled
chrU
for 'unmapped'
- subtelomeres that could not be aligned are labeled
position
anchor coordinateref_samp
the specific T2T reference contig to which the subtelomere was alignedallele_id
ID number for this specific allele- ids ending in
i
indicate subtelomeres that were aligned to known interstitial telomere regions. These alleles should likely be excluded from subsequent analyses.
- ids ending in
TL_p75
ATL (reports 75th percentile by default)read_TLs
ATL of each supporting read in the clusterread_lengths
length of each read in the clusterread_mapq
mapping quality of each read in the clustertvr_len
length of the cluster's TVR regiontvr_consensus
consensus TVR region sequencesupporting_reads
readnames of each read in the cluster
The reference sequence used for telomere anchoring currently contains the first and last 500kb of sequences from the following T2T assemblies:
T2T-chm13
- https://github.com/marbl/CHM13T2T-yao
- https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA017932T2T-cn1
- https://github.com/T2T-CN1/CN1T2T-hg002
- https://github.com/marbl/hg002T2T-ksa001
- https://github.com/bio-ontology-research-group/KSA001T2T-i002c
- https://github.com/LHG-GG/I002C
More will be added as they become available.