nn_solution : scripts for training Neural Network base models
protlib : utils and code to train Py-Boost and LogReg models, data preprocessing and efficient metric computation
protnn : utils and code to train GCN stacker model
CAFA5PIpeline.ipynb : CAFA5PIpeline.ipynb - notebook contains all the scripts calls and detailed explanation of each step. Also, contains directory structure (shoul be considered as both directory_structure.txt and entry_points.md)
Download.ipynb : since produced artifacts are quite large, we consider to store it in the cloud storage, instead uploading it on Kagge. To download all trained models, please execute this notebook. Explanation of contents is also provided. Note!! the artifacts will be stored for 6 month only. After that, you will need to compute it by yourself.
config.yaml : config used to execute training and inference.
create-pytorch-env.sh : install all the requirements to run all deep learning parts
create-rapids-env.sh : install all the requirements to run processing and ML steps
CAFA5docs.pdf : detailed solution description

HARDWARE

We used the following setup to train:

24 CPUs
512 GB RAM
2 x Tesla V100 32 GB

Minimal required hardware:

8 CPUs
64 GB RAM
1 x Tesla V100 32 GB
300 GB disk space

SOFTWARE

Ubuntu 18.04
Nvidia driver version 450
python>=3.8 to run CAFA5PIpeline.ipynb and Download.ipynb notebooks. This python will not be used to train the models, it only launches the execution notebooks. Only requred libraries are pyyaml to read config.yaml and kaggle to obtain the original dataset via API
conda>=23.5.2. We need one of the latest version to use Mamba solver. Otherwise, setup the environments will take hours

Other required tools will be installed via create-pytorch-env.sh and create-rapids-env.sh scripts.

pytorch-env is the environment to train DL models. It will install pytorch, cupy, and some extra bio libraries
rapids-env is the enviromnent to do preprocessing and train ML models. It uses NVIDIA RAPIDS toolkit (cudf) and cupy libraries to make the efficient dataprocessing, metric computation (including custom CUDA kernels for graph manipulation) and custom ML algorithms implementations.

DATA AND ENV SETUP

To install default python dependencies, please execute pip install -r requirements.txt

To obtain the original Kaggle dataset, please execute (be sure you get personal access kaggle token)

kaggle competitions download -c cafa-5-protein-function-prediction
unzip cafa-5-protein-function-prediction.zip

NEXT STEPS

To reproduce the solution, please step by step execute the notebook CAFA5PIpeline.ipynb. There is also explanation provided to understand, what happends at each step.

You can skip some long executed cells of the CAFA5PIpeline.ipynb notebook by download the results using Download.ipynb notebooks. The numeration of steps match exactly in both notebooks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CONTENTS

HARDWARE

SOFTWARE

DATA AND ENV SETUP

NEXT STEPS

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
nn_solution		nn_solution
protlib		protlib
protnn		protnn
.gitignore		.gitignore
CAFA5PIpeline.ipynb		CAFA5PIpeline.ipynb
CAFA5docs.pdf		CAFA5docs.pdf
Download.ipynb		Download.ipynb
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
create-pytorch-env.sh		create-pytorch-env.sh
create-rapids-env.sh		create-rapids-env.sh
requirements.txt		requirements.txt

License

btbpanda/CAFA5-protein-function-prediction-2nd-place

Folders and files

Latest commit

History

Repository files navigation

CONTENTS

HARDWARE

SOFTWARE

DATA AND ENV SETUP

NEXT STEPS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages