Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide first party bulk download #601

Open
cthoyt opened this issue Oct 1, 2024 · 5 comments
Open

Provide first party bulk download #601

cthoyt opened this issue Oct 1, 2024 · 5 comments

Comments

@cthoyt
Copy link

cthoyt commented Oct 1, 2024

I would like to download the entire database all at once. Having a paginated API isn't ideal for this. It would be great if you had a bulk download link.

Perhaps this could be periodically dumped to zenodo

@cthoyt
Copy link
Author

cthoyt commented Oct 1, 2024

In the mean time, I wrote a library that extracts everything from the API (https://github.com/cthoyt/biotools-client, takes about 2 hours) and uploaded the results to Zenodo under the CC BY 4.0 license (https://zenodo.org/records/13869530)

@redmitry
Copy link
Contributor

redmitry commented Oct 1, 2024

I did similar task for tools monitoring...
https://gitlab.bsc.es/inb/elixir/openebench/tools-monitoring/-/blob/main/tools-monitoring-import/biotools-import.py
It converts biotools to bioschemas, so removing all the conversion part and saving to a disk instead of pushing to the server...
It may need some adoption :-(

Cheers,

D.

P.S. importing all biotools should take 10 minutes... 2 hours sounds a lot.

@veitveit
Copy link
Member

veitveit commented Oct 2, 2024

This is a very good point and we are well aware of it.

The future synchronization with the ecosystem might partly solve this.

I assume a zipped json would be preferred?

We are backing up the content in as sql files which could be an alternative (after removal of user information).

@cthoyt
Copy link
Author

cthoyt commented Oct 2, 2024

Yes, a zipped or otherwise compressed JSON would get the job done! Make sure this archive is on an external system outside of ELIXIR infrastructure, so it will still exist after ELIXIR ends (future planning ;))

@redmitry
Copy link
Contributor

redmitry commented Oct 4, 2024

Hi All,
I made a simplest bio.tools importer I could. On my notebook it takes 15 min to load all data.
I also noticed, that bio.tools doesn't compress the data sent via the API.
You may save some time and a lot of traffic enabling gzip on the server...

Best,

Dmitry
biotools-backup.zip

P.S. I put Apache 2.0 license coz without a license usage of any piece of code may be troublesome....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants