Protein Structures Archiver (ProteStAr) is a tool designed to compress collections of files describing protein structures. In the current version it supports PDB, mmCIF, PAE (prediction aligned errors in JSON format), confidence (in JSON format).
The tool offers high compression ratios (more than 4 times better than gzip) and fast random access queries. For example the whole ESM Atlas v0 database with ~600M protein structures of raw size 67.7 TB (15.5 TB when gzipped) can be stored in 3.78 TB. Moreover, when we turn on one of lossy modes this drops to 1.59 TB. The compression of the whole dataset took about 30 hours using 16 thread on workstation equipped with AMD TR 3995WX CPU.
git clone --recurse-submodules https://github.com/refresh-bio/protestar
cd protestar && make -j
# Compress a collection of 10 PDB files in a single directory --- lossless mode
./bin/protestar add --type pdb --indir apsd-data/pdb/ --out test_pdb.psa
# Compress a collection of 10 CIF files in a single directory --- lossless mode
./bin/protestar add --type cif --indir apsd-data/cif/ --out test_cif.psa
# Compress a collection of 10 PAE files in a single directory --- lossless mode
./bin/protestar add --type pae --indir apsd-data/pae/ --out test_pae.psa
# Compress a collection of all files in a single directory (traversed recursively) --- lossless mode
./bin/protestar add --type ALL --indir-recursive apsd-data/ --out test_all.psa
# Compress a collection of 10 PDB files in a single directory --- lossy and minimal mode
./bin/protestar add --type pdb --indir apsd-data/pdb/ --out test_pdb_10_100.psa --minimal --lossy --max-error-bb 10 --max-error-sc 100
# Extend a collection of previously compressed data by adding 10 PAE files in lossy mode (level 2)
./bin/protestar add --type pae --indir apsd-data/pae/ --in test_pdb_10_100.psa --out test_mixed.psa --lossy --pae-lossy-level 2
# List contents of the archive (all file types)
./bin/protestar list --in test_mixed.psa --type ALL
# Extract a single file from the archive to the current directory
./bin/protestar get --in test_mixed.psa --outdir ./ --type ALL --file AF-A0A075B6I1-F1-model_v4
# Extract all PAE files from the archive
./bin/protestar get --in test_mixed.psa --outdir ./ --type pae --all
# Show some info about the archive
./bin/protestar info --in test_mixed.psa
ProteStAr should be downloaded from https://github.com/refresh-bio/protestar and compiled. The supported OS are:
- Windows: Visual Studio 2022 solution provided,
- Linux: make project (G++ 9.0 or newer required).
Support for MacOS and well as ARM-based CPUs will be added soon.
- 1.1.0 (8 May 2024)
- Support of ANISOU, SIGATM, and SIGUIJ sections in PDB files.
- Some fixes in the PDB and CIF output formatting.
- Support of MacOS (M1 and x64 architectures).
- 1.0.0 (8 December 2023)
- pyprotestar Python package added,
- fixed incorrect alignment of ATOM column in some PDB files.
- 0.7 (20 July 2023)
- first public release.
protestar <command> <options>
Command:
add
– add files to archiveget
– get files from archivelist
– list archive contentsinfo
- show some statistics of the archive
protestar add <options>
Options:
--type <string>
– file type: cif, pdb, pae, conf, ALL--in <string>
– name of input archive (if you want to extend an existing archive)--out <string>
– name of output archive--infile <string>
– file name to add--indir <string>
– directory with files to add--indir-recursive <string>
– directory (recursive) with files to add--inlist <string>
– name of file with paths to files to add--intar <string>
– name of tar file with files to add-t|--threads <int>
– no. of threads--fast
– slightly faster compression but slightly worse ratio (only for CIF|PDB files)--minimal
– minimal mode (only most important fields from CIF|PDB files are stored)--lossy
– turn-on lossy compression (only for CIF|PDB|PAE files)--max-error-bb
– max error (in mA [0, 500]) of backbone atom coordinates (only for CIF|PDB files)--max-error-sc
– max error (in mA [0, 500]) of side-chain atom coordinates (only for CIF|PDB files)--pae-lossy-level
– lossy level from range [0, 4] (only for PAE files)--single-bf
– enable single B-factor value (only for CIF|PDB files)-v|--verbose <int>
– verbosity level
protestar get <options>
Options:
--type <string>
– file type: cif, pdb, pae, conf, ALL--in <string>
– name of input archive--outdir <string>
– output directory--file <string>
– file name to get--list <string>
– name of file with file names to get--all
– get all files-t|--threads <int>
– no. of threads-v|--verbose <int>
– verbosity level
protestar list <options>
Options:
--in <string>
– name of input archive--type <string>
– file type: cif, pdb, pae, conf, ALL--show-file-info
– show some information about file types
protestar info <options>
Options:
--in <string>
– name of input archive
ProteStAr files can be accessed also with C++ library. Python library will be available soon.
THe C++ API is provided in src/lib-cxx/protestar-api.h
file.
You can also take a look at src/example_api
to see the API in use.
ProteStAr archives can be accessed through pyprotestar Python package. The package has to be compiled separately:
make -j pyprotestar
As a result, a library named like pyprotestar.cpython-38-x86_64-linux-gnu.so
will be created in the pyprotestar
directory. To make the package visible in Python, go to this directory and extend the PYTHONPATH
environment variable with the following commands:
cd pyprotestar
source set_path.sh
After that, pyprotestar package can be imported in a Python script:
import pyprotestar
In the current directory one can find an example script named pyprotestar_test.py which compresses a set of CIF, PDB, PAE, and CONF files using a regular ProteStAr binary (it assumes it was previously built with make -j
command and is available in ../bin/
subdirectory) and then accesses the resulting archive using pyprotestar. A single file of each type is extracted from the archive and given as an input to the appropriate parser. In particular, Bio.PDB package from Biopython is used for CIF/PDB, while PAE and CONF files are parsed using regular JSON library. Therefore, Biopython has to be installed prior running the script:
pip install biopython
python3 pyprotestar_test.py
The data in apsd-data were selected from AlphaFold Protein Structure Database to allow quick experiments of the tool.
- The full datasets used in the experiments were taken from AlphaFold Protein Structure Database and ESM Atlas.
- The subset of ESM Atlas used in the experiments can be downloaded from ESM subset (1.6 GB file).
- After decompression of CIF files, the formating of tables may be a bit different than the original one. The contents is, however, the same.
Deorowicz, S., Gudyś, A. (2023) Efficient protein structure archiving using ProteStAr, biorXiv preprint, https://doi.org/10.1101/2023.07.20.549913.