This repository contains the fragment screeening analysis pipeline (FRAGSYS) used for the analysis of our manuscript Classification of likely functional class for ligand binding sites identified from fragment screening.
Our pipeline for the analysis of binding sites, FRAGSYS, can be executed from the jupyter notebook running_fragsys.ipynb
. The input for this pipeline is a table containing a series of PDB codes and their respective UniProt accession identifiers.
For complete installation instructions refer here.
Refer to run jupyter notebook running_fragsys.ipynb
in order to run FRAGSYS. You can do so interactively in a notebook by running this command: main(main_dir, prot, panddas)
using the appropriate environment: varalign_env.
Where main_dir
is the directory where the output will be saved, prot
is the query protein, and panddas
is a pandas dataframe that has to contain at least two columns: entry_uniprot_accession
, and pdb_id
, for all protein structures in the data set.
For another example, check this other notebook
where we ran FRAGSYS for the main protease (MPro) of SARS-CoV-2 (P0DTD1).
For each structural segment of each protein in panddas
, FRAGSYS will:
- Download biological assemblies from PDBe
- Structurally superimpose structures using STAMP
- Get accessibility and secondary structure elements from DSSP via ProIntVar
- Mapping PDB residues to UniProt using SIFTS
- Obtain protein-ligand interactions running Arpeggio
- Cluster ligands into binding sites using OC
- Generate visualisation scripts for UCSF Chimera
- Generate multiple sequence alignment (MSA) with jackhmmer
- Calculate Shenkin divergence score [1]
- Calculate missense enrichment scores with VarAlign
The final output of the pipeline consists of multiple tables for each structural segment collating the results from the different steps of the analysis for each residue, and for the defined ligand binding sites. These data include relative solvent accessibility (RSA), angles, secondary structure, PDB/UniProt residue number, alignment column, column occupancy, divergence score, missense enrichment score, p-value, etc.
These tables are concatenated into master tables, with data for all 37 structual segments, which form the input for the analyses carried out in the analysis
notebooks.
Refer to notebook 15 to predict RSA cluster labels for your binding sites of interest.
The pipeline, as well as the whole of the analysis are run in an interactive manner in a series of jupyter notebooks, found in the analysis
folder.
Third party dependencies for these notebooks include:
- Arpeggio (GNU GPL v3.0 License)
- DSSP (Boost Software License)
- Hmmer (BSD-3 Clause License)
- OC
- STAMP (GNU GPL v3.0 License)
- ProIntVar (MIT License)
- ProteoFAV (MIT License)
- VarAlign (MIT License)
Other standard python libraries:
- Biopython (BSD 3-Clause License)
- Keras (Apache v2.0 License)
- Matplotlib (PSF License)
- Numpy (BSD 3-Clause License)
- Pandas (BSD 3-Clause License)
- Scipy (BSD 3-Clause License)
- Seaborn (BSD 3-Clause License)
- Scikit-learn (BSD 3-Clause License)
- Tensorflow (Apache v2.0 License)
For more information on the dependencies, refere to the .yml files in the envs
directory. To install all the dependencies, refer to the installation manual.
Apart from the INSTALL, LICENSE and README files, there are 5 other files on this repository main directory. Two of these are python libraries, a configuration file and two notebooks.
fragsys_config.txt
contains the default parameters to run FRAGSYS and it is read byfragsys.py
.fragsys.py
contains all the function, lists and dictionaries needed to run the pipeline.fragsys_main.py
contains the main FRAGSYS function, where all functions infragsys.py
are called. This script represents the pipeline itself.running_fragsys.ipynb
is the notebook where the pipeline is executed in an interactive way.running_fragsys_for_MPRO.ipynb.ipynb
is the notebook where the pipeline is executed in an interactive way for a case study of SARS-CoV-2 MPro.
There are 6 directories in this repository.
This environment contains clean_pdb.py, a python script grabbed from here. This script will be used to pre-process the PDB files before running Arpeggio on them.
The envs folder contains three .yml files describing the necessary packages and dependencies for the different parts of the pipeline and analysis.
- arpeggio_env contains Arpeggio.
- deep_learning_env contains the packages necessary to do the machine learning in notebooks 11, and 12.
- main_env supports all analysis notebooks, with the exception of number 11, 12, in which the machine learning models are executed.
- varalign_env is needed to run FRAGSYS.
The input folder contains the main input file which is used as input to run FRAGSYS on the running_fragsys notebook.
The analysis folder contains all the notebooks used to carry out the analysis of the 37 fragment screening experiments. main_env is needed to run these notebooks.
The results folder contains all the results files generated by the notebooks in the analysis folder.
The figs folder contains the main figures generated and saved by the analysis notebooks.
If you use FRAGSYS, please cite:
Utgés, J.S. et al. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol 7, 320 (2024). https://doi.org/10.1038/s42003-024-05970-8
- Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991; 11(4):297–313. Epub 1991/01/01. https://doi.org/10.1002/prot.340110408 PMID: 1758884.