Skip to content

BenderGroup/PIDGINv4

Repository files navigation

Prediction IncluDinG INactivity (PIDGIN) Version 4.2

UPDATE MAR 2022: The new no-orthologue models can be downloaded at https://doi.org/10.6084/m9.figshare.19108382.v1 - remove the old no_ortho directory, and download and unzip the folder in the PIDGINv4 root directory. See the ReadtheDocs for full installation instructions.

For now, the orthologue (--ortho) command is deprecated - use the old models at your own risk. If you require the orthologue models for your research, please get in touch!

license Documentation Status betarelease

Author : Maria-Anna Trapotsi, Layla Hosseini-Gerami and Lewis Mervin

Email: mat64@cam.ac.uk and lh605@cam.ac.uk

Supervisor : Dr. A. Bender

Protein target prediction using Random Forests (RFs) trained on bioactivity data from PubChem (extracted Mar 2020) and ChEMBL (version 26), using the RDKit and Scikit-learn, which employ a modification of the reliability-density neighbourhood Applicability Domain (AD) analysis by Aniceto [1]. This project is the sucessor to PIDGIN version 1 [2] and PIDGIN version 2 [3]. This is the updated and retrained version of PIDGIN version 3 Target prediction with extended NCBI pathway and DisGeNET disease enrichment calculation is available as implemented in [4].

  • Molecular Descriptors : 2048bit Rdkit Extended Connectivity FingerPrints (ECFP) [5]
  • Algorithm: Random Forests with dynamic number of trees (see docs for details), class weight = 'balanced', sample weight = ratio Inactive:Active
  • Models generated at four different cut-off's: 100μM, 10μM, 1μM and 0.1μM
  • Models generated both with and without mapping to orthologues, as implemented in [3]
  • Pathway information from NCBI BioSystems
  • Disease information from DisGeNET
  • Target/pathway/disease enrichment calculated using Fisher's exact test and the Chi-squared test

Details for sizes across all activity cut-offs

  Without orthologues With orthologues
Distinct Models 11,782 16,772
Distinct Targets [exhaustive total] 3,698 [11,782] 17,021 [63,140]
Total Bioactivities Over all models 50,210,041 437,574,005
Actives 4,079,996 4,087,155
Inactives [Of which are Sphere Exclusion (SE)] 46,130,045 [35,119,663] 463,237,781 [314,117,438]

Full details on all models are provided in the uniprot_information.txt files in the orthologue and no_orthologue directories

INSTRUCTIONS

Development occurs on GitHub.

Install with Conda

Documentation, installation and instructions are on ReadtheDocs.

IMPORTANT

  • Use the ReadtheDocs! You MUST download the models before running!
  • The program recognises as input line-separated SMILES in either .smi/.smiles or .sdf format
  • If the SMILES input contains data additional to the SMILES string, the first entries after the SMILES are automatically interpreted as identifiers (see the OpenSMILES specification §4.5) - although there are options to change this behaviour
  • Molecules are automatically standardized when running models (can be turned off)
  • Do not modify the 'pkls', 'ad_data' etc. names or directories
  • Files in the examples directory are included for testing as on the ReadtheDocs tutorials.
  • For installation and usage instructions, see the documentation.

License

PIDGINv4 is available under the GNU General Public License v3.0 (GPLv3).

References

[1]Aniceto, N, et al. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: Reliability-density neighbourhood. J. Cheminform. 8: 69 (2016). aniceto_doi
[2]Mervin, L H., et al. Target prediction utilising negative bioactivity data covering large chemical space. J. Cheminform. 7: 51 (2015). mervin2015_doi
[3](1, 2) Mervin, L H., et al. Orthologue chemical space and its influence on target prediction. Bioinformatics. 34: 72–79 (2018). mervin2018_doi
[4]Mervin, L H., et al. Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. ACS Chem. Biol. 11: 11 (2016) mervin2016_doi
[5]Rogers D & Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50: 742-54 (2010). rogers_doi