Skip to content

btbpanda/CAFA5-protein-function-prediction-2nd-place

Repository files navigation

Hello!

Here are the instructions to reproduce the CAFA5 2nd solution using given code

CONTENTS

  • nn_solution : scripts for training Neural Network base models
  • protlib : utils and code to train Py-Boost and LogReg models, data preprocessing and efficient metric computation
  • protnn : utils and code to train GCN stacker model
  • CAFA5PIpeline.ipynb : CAFA5PIpeline.ipynb - notebook contains all the scripts calls and detailed explanation of each step. Also, contains directory structure (shoul be considered as both directory_structure.txt and entry_points.md)
  • Download.ipynb : since produced artifacts are quite large, we consider to store it in the cloud storage, instead uploading it on Kagge. To download all trained models, please execute this notebook. Explanation of contents is also provided. Note!! the artifacts will be stored for 6 month only. After that, you will need to compute it by yourself.
  • config.yaml : config used to execute training and inference.
  • create-pytorch-env.sh : install all the requirements to run all deep learning parts
  • create-rapids-env.sh : install all the requirements to run processing and ML steps
  • CAFA5docs.pdf : detailed solution description

HARDWARE

We used the following setup to train:

  • 24 CPUs
  • 512 GB RAM
  • 2 x Tesla V100 32 GB

Minimal required hardware:

  • 8 CPUs
  • 64 GB RAM
  • 1 x Tesla V100 32 GB
  • 300 GB disk space

SOFTWARE

  • Ubuntu 18.04
  • Nvidia driver version 450
  • python>=3.8 to run CAFA5PIpeline.ipynb and Download.ipynb notebooks. This python will not be used to train the models, it only launches the execution notebooks. Only requred libraries are pyyaml to read config.yaml and kaggle to obtain the original dataset via API
  • conda>=23.5.2. We need one of the latest version to use Mamba solver. Otherwise, setup the environments will take hours

Other required tools will be installed via create-pytorch-env.sh and create-rapids-env.sh scripts.

  • pytorch-env is the environment to train DL models. It will install pytorch, cupy, and some extra bio libraries
  • rapids-env is the enviromnent to do preprocessing and train ML models. It uses NVIDIA RAPIDS toolkit (cudf) and cupy libraries to make the efficient dataprocessing, metric computation (including custom CUDA kernels for graph manipulation) and custom ML algorithms implementations.

DATA AND ENV SETUP

To install default python dependencies, please execute pip install -r requirements.txt

To obtain the original Kaggle dataset, please execute (be sure you get personal access kaggle token)

kaggle competitions download -c cafa-5-protein-function-prediction
unzip cafa-5-protein-function-prediction.zip

NEXT STEPS

To reproduce the solution, please step by step execute the notebook CAFA5PIpeline.ipynb. There is also explanation provided to understand, what happends at each step.

You can skip some long executed cells of the CAFA5PIpeline.ipynb notebook by download the results using Download.ipynb notebooks. The numeration of steps match exactly in both notebooks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published