Restore improvement: index #4044

Michal-Leszczynski · 2024-09-25T11:54:39Z

This PR is the core of restore improvements prepared for SM 3.4.
It refactors the general workflow from restoring location by location, manifest by manifest, table by table, to a version where nothing waits on nothing! All of the remote sstable dirs are indexed at the beginning, fed to batch dispatcher, which controls batch creation. For now, the batch dispatcher creates batches in the same way as before, but it will be imporved as a part of #3979.

Rebased on #4040.

Fixes #3981

Right now SM restores location by location, manifest by manifest, table by table. That's why it tracks restore progress by keeping location/manifest/table in the DB. We are moving away from sequential restore approach in favor of restoring from all locations/manifests/tables at the same time.

…dFiles There is no need to iterate versioned files (ListVersionedFiles) and not versioned files (buildFilesSizesCache) separately. Doing it in a single iteration is faster, and it allows to store all size information in a single place.

Michal-Leszczynski · 2024-09-27T08:16:10Z

@karol-kokoszka This PR is ready for review!
It already passed CI once, but I decided to split one commit, so the CI is running again. Let's hope this refactor won't break anything.

mykaul · 2024-09-30T14:01:33Z

This PR is the core of restore improvements prepared for SM 3.4. It refactors the general workflow from restoring location by location, manifest by manifest, table by table, to a version where nothing waits on nothing! All of the remote sstable dirs are indexed at the beginning, fed to batch dispatcher, which controls batch creation. For now, the batch dispatcher creates batches in the same way as before, but it will be imporved as a part of #3979.

Rebased on #4040.

Fixes #3981

General note - I'm lacking some documentation here. For example, how do we sort them (if we do) after indexing? do we have any way to control it, etc.

karol-kokoszka · 2024-09-30T15:11:17Z

We have just the document (design doc?) describing the restore in Scylla Manager 3.1.
https://docs.google.com/document/d/1ugf6XCs28c4pwKsw9Lbu_EFb6ZSgMpJ_-Usf-Q9XAP4/edit#heading=h.f7jv2tbf2ofd

I agree that all the details that are going to be included into 3.4 must be documented.

We can either make it public, by editing README.md of https://github.com/scylladb/scylla-manager/tree/master/pkg/service/restore and putting reference to it in main README, or by just preparing new design doc documenting all the changes/updates or current state.

The design, list of changes were introduced on our meeting where we wanted to understand the bottlenecks from the current implementation. We haven't created formal design doc.... but just set of issues.

Issue for the documentation #4051

karol-kokoszka

👍

pkg/service/restore/index.go

Michal-Leszczynski · 2024-10-02T18:46:50Z

@karol-kokoszka I addressed all the comments, please take a look.

Michal-Leszczynski · 2024-10-03T07:45:47Z

Failures are known issues/flakes:

karol-kokoszka · 2024-10-03T08:27:41Z

Please check why Sanity check fails.

Michal-Leszczynski · 2024-10-03T08:36:34Z

Failures are known issues/flakes:

Investigate lates-enterprise failures regarding user creation in restore schema tests #4050

TestRetryRest/TestRetryCancelContext/Repeat is flaky #3848

Please check why Sanity check fails.

Because of the known flake: #3848

This commit introduces the structure of restore workload. Workload is divided per location->table->remote sstable directory. This changes the hierarchy established by manifests (location->node->table->remote sstable dir). It also aggregates files into actual sstables, extracts their IDs, and aggregates their sizes, and keeps track of sstable versioning.

Indexed workload won't contain sstables that were already restored during previous restore run.

This is a temporary implementation used for integrating workload indexing with the rest of the code. It will be improved as a part of the #3979.

This commit makes use of the new indexing and batching approaches and uses them in the restore tables codebase.

Recent commits changed versioned batch download so that if any sstable component is versioned, then all sstable components are downloaded as versioned files. It was done in that way to allow easier versioned progress calculation (we don't store per file size, only the whole sstable size). This brought to light a bug (that existed before, but was more difficult to hit), in which restoring batch failed when the whole batch was versioned, as calling RcloneSyncCopyPaths on empty paths parameter resulted in broken download. We could just skip the RcloneSyncCopyPaths call when the whole batch is versioned, but this would leave us without the agentJobID which is a part of sort key in RestoreRunProgress. Without it, we could potentially overwrite one restore run progress with another - if both of them happened on the same RemoteSSTableDir, by the same Host, and were fully versioned. It would also introduce a different path for restoring regular batch and fully versioned batch, which is not desirable. That's why I decided to modify rclone server to allow empty path parameter, so that it still generates agentJobID, but it doesn't do anything except for that.

Workload info contains location/table/remote sstable dir sstable count, total size, max and average sstable size.

Michal-Leszczynski force-pushed the ml/restore-index branch 3 times, most recently from db5c9ec to 1f15c79 Compare September 26, 2024 12:41

Michal-Leszczynski added 2 commits September 26, 2024 16:33

feat(restore): adjust model to dropped sequential restore run tracking

899edfe

Michal-Leszczynski force-pushed the ml/restore-index branch from 1f15c79 to bc04376 Compare September 26, 2024 14:38

Michal-Leszczynski force-pushed the ml/restore-index branch from bc04376 to 1762b02 Compare September 27, 2024 08:14

Michal-Leszczynski marked this pull request as ready for review September 27, 2024 08:14

Michal-Leszczynski requested a review from karol-kokoszka as a code owner September 27, 2024 08:14

Michal-Leszczynski mentioned this pull request Oct 2, 2024

Restore improvement: transfers #4054

Merged

karol-kokoszka approved these changes Oct 2, 2024

View reviewed changes

Michal-Leszczynski force-pushed the ml/restore-index branch from 97f9642 to 0efed73 Compare October 2, 2024 19:30

karol-kokoszka approved these changes Oct 3, 2024

View reviewed changes

Michal-Leszczynski added 7 commits October 3, 2024 10:53

feat(restore): index, support resume

f6b5c0a

Indexed workload won't contain sstables that were already restored during previous restore run.

feat(restore): index, support metrics init

737cccb

feat(restore): add primitive batching using indexed workload

72f9c6f

This is a temporary implementation used for integrating workload indexing with the rest of the code. It will be improved as a part of the #3979.

feat(restore): integrate new indexing and batching with codebase

c0edb79

This commit makes use of the new indexing and batching approaches and uses them in the restore tables codebase.

feat(restore): index, log workload info

503bd5c

Workload info contains location/table/remote sstable dir sstable count, total size, max and average sstable size.

Michal-Leszczynski force-pushed the ml/restore-index branch from 0efed73 to 503bd5c Compare October 3, 2024 08:53

Michal-Leszczynski merged commit 14aef7b into master Oct 3, 2024
49 of 51 checks passed

Michal-Leszczynski deleted the ml/restore-index branch October 3, 2024 09:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore improvement: index #4044

Restore improvement: index #4044

Michal-Leszczynski commented Sep 25, 2024 •

edited

Loading

Michal-Leszczynski commented Sep 27, 2024

mykaul commented Sep 30, 2024

karol-kokoszka commented Sep 30, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Oct 2, 2024

Michal-Leszczynski commented Oct 3, 2024

karol-kokoszka commented Oct 3, 2024

Michal-Leszczynski commented Oct 3, 2024

Restore improvement: index #4044

Restore improvement: index #4044

Conversation

Michal-Leszczynski commented Sep 25, 2024 • edited Loading

Michal-Leszczynski commented Sep 27, 2024

mykaul commented Sep 30, 2024

karol-kokoszka commented Sep 30, 2024

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Oct 2, 2024

Michal-Leszczynski commented Oct 3, 2024

karol-kokoszka commented Oct 3, 2024

Michal-Leszczynski commented Oct 3, 2024

Michal-Leszczynski commented Sep 25, 2024 •

edited

Loading