UPDATE MAR 2022: The new no-orthologue models can be downloaded at https://doi.org/10.6084/m9.figshare.19108382.v1 - remove the old no_ortho directory, and download and unzip the folder in the PIDGINv4 root directory. See the ReadtheDocs for full installation instructions.
For now, the orthologue (--ortho) command is deprecated - use the old models at your own risk. If you require the orthologue models for your research, please get in touch!
Author : Maria-Anna Trapotsi, Layla Hosseini-Gerami and Lewis Mervin
Email: mat64@cam.ac.uk and lh605@cam.ac.uk
Supervisor : Dr. A. Bender
Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted Mar 2020) and ChEMBL (version 26), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2] and PIDGIN version 2 [3]. This is the updated and retrained version of PIDGIN version 3 Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].
- Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP) [5]
- Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
- Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
- Models generated both with and without mapping to orthologues, as implemented in [3]
- Pathway information from NCBI BioSystems
- Disease information from DisGeNET
- Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test
Details for sizes across all activity cut-offs
Without orthologues | With orthologues | |
---|---|---|
Distinct Models | 11,782 | 16,772 |
Distinct Targets [exhaustive total] | 3,698 [11,782] | 17,021 [63,140] |
Total Bioactivities Over all models | 50,210,041 | 437,574,005 |
Actives | 4,079,996 | 4,087,155 |
Inactives [Of which are Sphere Exclusion (SE)] | 46,130,045 [35,119,663] | 463,237,781 [314,117,438] |
Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories
Development occurs on GitHub.
Documentation, installation and instructions are on ReadtheDocs.
- Use the ReadtheDocs! You MUST download the models before running!
- The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
- If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
- Molecules are automatically standardized when running models (can be turned off)
- Do not modify the 'pkls', 'ad_data' etc. names or directories
- Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
- For installation and usage instructions, see the documentation.
PIDGINv4 is available under the GNU General Public License v3.0 (GPLv3).
[1] | Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016). |
[2] | Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015). |
[3] | (1, 2) Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018). |
[4] | Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) |
[5] | Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010). |