yanocomp
does detection of RNA modifications from nanopore direct RNA sequencing (DRS) data which has been "eventaligned" using nanopolish
. It uses a comparative approach with General mixture models similar to those used by nanocompore
and xpore
. The main selling points of yanocomp
are:
- It parses nanopolish eventalign output on the fly into a (relatively) compact HDF5 file allowing random access.
- A GTF file can be provided to convert transcriptomic coordinates from eventalign back to genomic coordinates.
- It uses the fast
pomegranate
library for model fitting. - It fits models using multiple adjacent kmers to better separate modified and unmodified distributions.
- A uniform distribution is used to model outliers caused by low quality signal or poor alignment. This improves the fit and reduces false positives.
- The trained models can be used to make single molecule modification predictions for each read at each significant position.
For the comparative method, you will need nanopore DRS data from a control sample with normal levels of modifications and a treatment sample with altered levels of modifications. NB: yanocomp
is still in quite a beta stage so there are likely to be bugs!
yanocomp
has been tested with python 3.8.5, and requires numpy
, pandas
, h5py
, pomegranate
, scipy
, statsmodels
, and click
. The easiest way to install it is using the conda environment yaml provided:
git clone https://github.com/bartongroup/yanocomp.git
cd yanocomp
conda env create -f yanocomp.yml
conda activate yanocomp
Alternatively yanocomp
and the required packages can be installed using pip:
pip install git+git://github.com/bartongroup/yanocomp.git
Before running yanocomp
, you should align your nanopore DRS data to a transcriptome reference (with no spliced alignment, no secondary alignments), for example using minimap2
. You should then align the signal level data using nanopolish
with the command:
nanopolish eventalign \
--scale-events \
--signal-index \
--print-read-names \
-r <fastq-fn> \
-b <bam-fn> \
-g <reference-fasta-fn>
yanocomp
itself has two main commands:
yanocomp prep
will parse a text or gzipped file containing the tabular results from nanopolish eventalign
and summarise them into a more compact HDF5 file. Output from eventalign can also be piped to yanocomp
prep on the fly. Basecalled reads should be aligned to a transcriptome reference before event alignment, but yanocomp prep
can convert back to genomic coordinates if a GTF file is provided.
$ yanocomp prep --help
Usage: yanocomp prep [OPTIONS]
Parse nanopolish eventalign tabular data into a more managable HDF5
format...
Options:
-e, --eventalign-fn TEXT File containing output of nanopolish eventalign.
Can be gzipped. Use - to read from stdin
[default: -]
-h, --hdf5-fn TEXT Output HDF5 file [required]
-g, --gtf-fn TEXT Optional GTF file which can be used to convert
transcriptomic coordinates to genomic
-p, --processes INTEGER [default: 1]
--help Show this message and exit.
$ nanopolish eventalign -t 28 \
--scale-events --signal-index --print-read-names \
-r reads.fastq \
-b aligned.bam \
-g transcriptome.fasta |
diffmod prep -p 4 \
-g annotation.gtf \
-h output.hdf5
A HDF5 file storing summarised event level signal data. The file contains one group per gene (if a GTF is provided) or transcript (if no GTF is provided).
yanocomp gmmtest
compares samples from two conditions to identify positions with differences in the profile of aligned nanopore signal "events". Positions where there are detectable differences in signal (detected with Kolmogorov Smirnov test, default p<0.05 and KS statistic > 0.1) will progress to model fitting. This is done by fitting a two-gaussian general mixture model (plus a third uniform distribution for outliers) using pomegranate
and then estimating the fraction of modified and unmodified reads in each sample. These estimates are used to perform a G-test. Fitted models can be optionally used to make single-molecule modification predictions.
$ yanocomp gmmtest --help
Usage: yanocomp gmmtest -c <cntrl_hdf5_1> \
-c <cntrl_hdf5_2> \
-t <treat_hdf5_1> \
-t <treat_hdf5_2> \
[OPTIONS]
Differential RNA modifications using nanopore DRS signal level data
Options:
-c, --cntrl-hdf5-fns TEXT Control HDF5 files. Can specify multiple
files using multiple -c flags [required]
-t, --treat-hdf5-fns TEXT Treatment HDF5 files. Can specify multiple
files using multiple -t flags [required]
-o, --output-bed-fn TEXT Output bed file name [required]
-s, --output-sm-preds-fn TEXT JSON file to output single molecule
predictions. Can be gzipped (detected from
name)
-m, --model-fn TEXT Model file with expected kmer current
distributions
--test-level [gene|transcript] Test at transcript level or aggregate to
gene level [default: gene]
-w, --window-size INTEGER How many adjacent kmers to model over
[default: 5]
-e, --outlier-factor FLOAT Scaling factor for labelling outliers during
model initialisation. Smaller means more
aggressive labelling of outliers [default:
0.5]
-n, --min-read-depth INTEGER Minimum reads per replicate to test a
position. Default is to set dynamically
-d, --max-fit-depth INTEGER Maximum number of reads per replicate used
to fit the model [default: 1000]
-k, --min-ks FLOAT Minimum KS test statistic to attempt to
build a model for a position [default: 0.2]
-f, --fdr-threshold FLOAT False discovery rate threshold for output
[default: 0.05]
-p, --processes INTEGER RANGE [default: 1;x>=1]
--help Show this message and exit.
$ yanocomp gmmtest -p 12 \
-c cntrl_1.hdf5 -c cntrl_2.hdf5 -c cntrl_3.hdf5 \
-t treat_1.hdf5 -t treat_1.hdf5 -t treat_3.hdf5 \
-o output.bed -s output_sm_preds.json.gz
A 19-column BED file format with the following values:
1. chrom [string]
2. start [integer]
3. end [integer]
4. gene_id:kmer [string]
5. score (min(- log10(FDR), 100)) [integer]
6. strand [string, either '+' or '-']
7. modification log ratio (plus 95% conf intervals) [str]
8. G-test p-value [float]
9. G-test FDR [float]
10. Control sample estimated fraction modified [float]
11. Treatment sample estimated fraction modified [float]
12. G-statistic (Treat vs cntrl) [float]
13. Homogeneity G-statistic [float]
14. Unmodified distribution mean for central kmer [float]
15. Unmodified distribution standard deviation [float]
16. Modified distribution mean for central kmer [float]
17. Modified distribution standard deviation [float]
18. KS statistic [float]
gmmtest
can also optionally output a json file containing the single molecule modification predictions. The schema may change but is currently:
{
"input_fns": {
"cntrl": {
"$rep": "<file_name>"
},
"treat": {
"$rep": "<file_name>"
}
}
"single_molecule_predictions": {
"$gene_id": {
"$pos": {
"kmers": [...], # array of strings, shape (window_size,)
"model": {
"unmod": {
"mu": [...], # array of floats, shape (window_size,)
"cov": [...], # array of floats, shape (window_size, window_size)
},
"mod": {
"mu": [...],
"cov": [...],
},
weights: [...] # array of floats, shape (2,)
},
"cntrl": {
"$rep": {
"read_ids": [...], # array of strings, shape (nreads,)
"events": [...], # array of floats, shape (nreads, window_size)
"preds": [...], # array of floats, shape (nreads,)
}
},
"treat": {
"$rep": {
"read_ids": [...],
"events": [...],
"preds": [...],
}
}
}
}
}
}
A preprint describing yanocomp
is available on bioRxiv
Parker,M.T. et al. (2021)
Yanocomp: robust prediction of m6A modifications in individual nanopore direct RNA reads.
bioRxiv, 2021.06.15.448494.