We present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs.
@article {Asgari345843,
author = {Asgari, Ehsaneddin and McHardy, Alice and Mofrad, Mohammad R. K.},
title = {Probabilistic variable-length segmentation of protein sequences for discriminative motif mining (DiMotif) and sequence embedding (ProtVecX)},
year = {2018},
doi = {10.1101/345843},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2018/07/12/345843},
eprint = {https://www.biorxiv.org/content/early/2018/07/12/345843.full.pdf},
journal = {bioRxiv}
}
An ipython notebook containing an example of motif discovery using DiMotif is provided here: https://github.com/ehsanasgari/dimotif/blob/master/notebook/DiMotif_step_by_step_example.ipynb
python3 dimotif.py --pos seqfile_of_positive_class --neg seqfile_of_negative_class --outdir output_directory --topn top_N_motifs --segs number_of_segmentations
Using the above mentioned command all the steps will be done sequentially and output will be organized in output directory.
--pos sequences file of the positive_class in txt or fasta format
--neg sequences file of the negative_class in txt or fasta format
--outdir output_directory
--topn how many motif to extract
--segs number of segmentation schemes to be sampled
- For a given set of positive sequences it extracts the most discriminative motifs in the positive class using a probabilistic segmentation inferred from Swiss-Prot
- Motifs are hierarchically clustered according to their co-occurrence patterns in the positive sequences Motifs are colored according to their most frequent secondary structure in PDB database
- For each motif the normalized biophysical scores are also provided for further biophysical interpretations
- The orange databases in the diagram are general-purpose databases and information. However, the red and blue databases are problem-specific datasets we want to find their related motifs.