Skip to content

bartongroup/FRAGSYS

Repository files navigation

FRAGSYS

This repository contains the fragment screeening analysis pipeline (FRAGSYS) used for the analysis of our manuscript Classification of likely functional class for ligand binding sites identified from fragment screening.

Our pipeline for the analysis of binding sites, FRAGSYS, can be executed from the jupyter notebook running_fragsys.ipynb. The input for this pipeline is a table containing a series of PDB codes and their respective UniProt accession identifiers.

DOI

Installation

For complete installation instructions refer here.

Pipeline methodology

Refer to run jupyter notebook running_fragsys.ipynb in order to run FRAGSYS. You can do so interactively in a notebook by running this command: main(main_dir, prot, panddas) using the appropriate environment: varalign_env.

Where main_dir is the directory where the output will be saved, prot is the query protein, and panddas is a pandas dataframe that has to contain at least two columns: entry_uniprot_accession, and pdb_id, for all protein structures in the data set.

For another example, check this other notebook where we ran FRAGSYS for the main protease (MPro) of SARS-CoV-2 (P0DTD1).

For each structural segment of each protein in panddas, FRAGSYS will:

  1. Download biological assemblies from PDBe
  2. Structurally superimpose structures using STAMP
  3. Get accessibility and secondary structure elements from DSSP via ProIntVar
  4. Mapping PDB residues to UniProt using SIFTS
  5. Obtain protein-ligand interactions running Arpeggio
  6. Cluster ligands into binding sites using OC
  7. Generate visualisation scripts for UCSF Chimera
  8. Generate multiple sequence alignment (MSA) with jackhmmer
  9. Calculate Shenkin divergence score [1]
  10. Calculate missense enrichment scores with VarAlign

The final output of the pipeline consists of multiple tables for each structural segment collating the results from the different steps of the analysis for each residue, and for the defined ligand binding sites. These data include relative solvent accessibility (RSA), angles, secondary structure, PDB/UniProt residue number, alignment column, column occupancy, divergence score, missense enrichment score, p-value, etc.

These tables are concatenated into master tables, with data for all 37 structual segments, which form the input for the analyses carried out in the analysis notebooks.

Refer to notebook 15 to predict RSA cluster labels for your binding sites of interest.

Dependencies

The pipeline, as well as the whole of the analysis are run in an interactive manner in a series of jupyter notebooks, found in the analysis folder.

Third party dependencies for these notebooks include:

Other standard python libraries:

For more information on the dependencies, refere to the .yml files in the envs directory. To install all the dependencies, refer to the installation manual.

Files

Apart from the INSTALL, LICENSE and README files, there are 5 other files on this repository main directory. Two of these are python libraries, a configuration file and two notebooks.

Directories

There are 6 directories in this repository.

This environment contains clean_pdb.py, a python script grabbed from here. This script will be used to pre-process the PDB files before running Arpeggio on them.

The envs folder contains three .yml files describing the necessary packages and dependencies for the different parts of the pipeline and analysis.

  • arpeggio_env contains Arpeggio.
  • deep_learning_env contains the packages necessary to do the machine learning in notebooks 11, and 12.
  • main_env supports all analysis notebooks, with the exception of number 11, 12, in which the machine learning models are executed.
  • varalign_env is needed to run FRAGSYS.

The input folder contains the main input file which is used as input to run FRAGSYS on the running_fragsys notebook.

The analysis folder contains all the notebooks used to carry out the analysis of the 37 fragment screening experiments. main_env is needed to run these notebooks.

The results folder contains all the results files generated by the notebooks in the analysis folder.

The figs folder contains the main figures generated and saved by the analysis notebooks.

Citation

If you use FRAGSYS, please cite:

Utgés, J.S. et al. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol 7, 320 (2024). https://doi.org/10.1038/s42003-024-05970-8

References

  1. Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991; 11(4):297–313. Epub 1991/01/01. https://doi.org/10.1002/prot.340110408 PMID: 1758884.