This repository contains the official implementation of 💡 Full-Atom Peptide Design based on Multi-modal Flow Matching (ICML 2024).
You can find our paper here. We also appreciate the inspiration from diffab and frameflow.
If you have any questions, please contact lijiahanypc@pku.edu.cn or ced3ljhypc@gmail.com. Thank you! :)
Please replace cuda and torch version to match your machine, here we test our code on CUDA >= 11.7, we also suggest using micromamba as a replace of conda.
conda env create -f environment.yml # or use micromamba instead of conda
conda activate flow
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.0.0+cu117.html
pip install joblib lmdb easydict
git clone https://github.com/Ced3-han/PepFlowww.git
We suggest adding the code to the Python environment variable, or you can use setup tools.
export PYTHONPATH=$(pwd):$PYTHONPATH
python setup.py develop
We provide data and pretrained model weights here.
- PepMerge_release.zip: 1.2GB
- PepMerge_lmdb.zip: 180MB
- model1.pt: 80MB
- model2.pt: 80MB
The PepMerge_release.zip
contains filtered data of peptide-receptor pairs. For example, in the folder 1a0n_A
, the P
chain in the PDB file 1a0n
is the peptide. In this folder, we provide the FASTA and PDB files of the peptide and receptor. The postfix _merge means the peptide and receptor are in the same PDB file. We also extract the binding pocket of the receptor, where our model is trained to generate peptides based on the binding pocket. You can also download PepBDB and QBioLip, and use playgrounds/gen_dataset
.ipynb to reproduce the dataset.
The PepMerge_lmdb.zip
contains several different splits of the dataset. We use mmseqs2
to cluster complexes based on receptor sequence identity. See playgrounds/cluster.ipynb
for details. The names.txt file contains the names of complexes in the test set. You can use models_con/pep_dataloader.py
to load these datasets. We suggest putting these LMDBs in a single Data
folder.
Besides, model1.pt
and model2.pt
are two checkpoints that you can load using models_con/flow_model.py
together with the config file configs/learn_angle.yaml. We suggest using model1 for benchmark evaluation and model2 for real-world peptide design tasks, the latter is trained on a larger dataset.
We will add more user-friendly straightforward pipelines (generation and evaluation) later.
By default, we support sampling of generated peptides from our processed dataset. You can use models_con/sample.py
to sample, and models_con/inference.py
to reconstruct PDB files.
If you want to use your own data, you can organize your data (peptide and pocket) as we did in PepMerge_release and construct a dataset for sampling and reconstruction. You can also use models_con/pep_dataloader/preprocess_structure
to parse a single data point.
Our evaluation involves many third-party packages, and we include some useful evaluation scripts in eval
. Please refer to our paper for details and download the corresponding packages for evaluation. Please use different python environments for these tools.
You can also train.py
on single GPU training and train_ddp.py
for multiple GPT training.
Future improvements on peptide generation models may include chemical modifications, non-canonical amino acids, pretraining on larger datasets, language models, better sampling methods, etc. Stay tuned and feel free to contact us for collaboration and discussion!
@InProceedings{pmlr-v235-li24o,
title={Full-Atom Peptide Design based on Multi-modal Flow Matching},
author={Li, Jiahan and Cheng, Chaoran and Wu, Zuofan and Guo, Ruihan and Luo, Shitong and Ren, Zhizhou and Peng, Jian and Ma, Jianzhu},
booktitle={Proceedings of the 41st International Conference on Machine Learning},
pages={27615--27640},
year={2024},
editor={Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
volume={235},
series={Proceedings of Machine Learning Research},
month=21--27 Jul},
publisher={PMLR},
}