Data and code associated with kinase activation state and cell viabiltiy modeling. Most of the code consists of Rmarkdown documents, with the model testing code saved as R script files. The code requires several packages and is organized into sequential steps:
I've written a script that checks for and installs all of the packages required in the repository. I use the pacman package for this purpose and installing pacman if missing is covered in the script. There are also two github based packages:
- BerginskiRmisc for my custom theme and helper scripts
- The helper script I use calls the "convert" command from imagemagick to trim whitespace around figures, but this functionality isn't critical for the rest of the results
- DarkKinaseTools for kinase lists
There are two primary data sets in the repository, results of a screening assay and the data downloaded from the supplement of Klaeger et al. These first scripts take each of these data sets and produce R data files that are then used in downstream processing.
- Screen Data: here
- Reads in and organizes the screen data
- Klaeger Data: here
- Organizes the Klaeger data for downstream processing
Most of the compound names in the screen/Klaeger collections don't match up exactly, so we had to go through and manually match most of the shared compounds. This report has the code used to simplify this search:
- Report here
With the compound/drug names matched, I preped the data sets for machine learning (both regression and above/below 90 viability classification). This code also removes any genes which don't vary in the Klaeger collection after the compounds have been filtered to only the matching compounds. The primary output here are machine learning ready data sets (deposited in the results) and cross-validation split data sets for both regression and binary classification.
- Report here
The model testing code is saved single self contained scripts which fully implement and run the hyperparameter scanning and model testing. The code is organized this way to make it simplier to run the modelling code on the UNC computing cluster, but should also be compatible with any computing environment. This code takes a long time to run. All of the regression testing models are available here and the binary above/below 90 are available here. There are also scripts (search for run_all_models.R) that build out directory infrastructure and run all the models sequentially.
Using the random forest model and associated code, predicting the rest of the Klaeger compounds is here.
I've attempted to write a single script that runs all the Rmarkdown files and scripts to completely reproduce the models and figures from the paper. I've only tested the code on Linux, but I see no reason why it wouldn't work on other platforms. Let me know if you attempt to run this script and it fails on your platform.
This script takes a long time to run (7 hours on a Ryzen 7 5800x). This is mostly due to the hyperparameter scanning for the regression and binary models. RAM usage also goes up fairly high during parameter scanning, so you should probably have 64 GB of RAM.
The following information was supplied regarding data availability: The code is available at GitHub and Zenodo: -https://github.com/gomezlab/PDACperturbations -https://doi.org/10.5281/zenodo.11623371 -Berginiski and Jenner (2024). Kinome state is predictive of cell viability in pancreatic cancer tumor and cancer-associated fibroblast cell lines.