Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you backfill bags from an existing storage provider to a new provider? #1007

Open
alexwlchan opened this issue Jun 10, 2022 · 0 comments

Comments

@alexwlchan
Copy link
Contributor

alexwlchan commented Jun 10, 2022

Suppose you add a new storage provider (see #1006). This means all new bags will be replicated to the new provider, but how do you backfill all the existing bags?

Assumptions

  • Every existing bag will be replicated to every storage provider. It would be additional work to support mixed locations.

Prior art

When we first build the storage service, we only replicated bags to Amazon S3.

We added support for Azure Blob later; backfilling the existing bags was a somewhat manual and hacked-together process that wouldn't be easily repeatable. We should try to find a more robust approach.

High-level proposal

Within the storage service, an ingest is a record of some processing on a bag. Currently we support two ingest types:

  • create a new copy of a bag
  • update an existing copy of a bag

Replicating an existing bag to a new location could be another type of ingest.

You'd start the job using the ingests API (exact design tbd), track it through the ingests API and ingests tracker, and it would be added to the storage manifest when the bag was completed. It'd look something like this:

graph LR
     A[... pipeline for<br/>new bags] --> EV[verifiers for<br/>new bags]
     EV --> RA[Replica aggregator]

     IA[ingests API] --> BRe[backfill replicator]
     BRe --> BV[backfill verifier]
     BV --> RA
     RA --> BR[Bag register]
Loading

Considerations

  • The storage service reporting would let you work out which bags haven't been backfilled into the new location.

  • You'd likely start from the "warm" replica location, which is S3, but objects in this location aren't always available for retrieval, e.g. sometimes objects get cycled to Glacier (see Large Things Living in Cold Places). Would we need a "bag warmer" step that retrieves any objects from Glacier before kicking off the replication step?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant