Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3Repo now uses Async client and TransferManager #613

Merged
merged 3 commits into from
Apr 26, 2024

Conversation

chelma
Copy link
Member

@chelma chelma commented Apr 25, 2024

Description

  • Updated the S3Repo to use the S3AsyncClient instead of the synchronous client
  • Updated S3Repo to download the shard blob files more efficiently using the S3 TransferManager

Issues Resolved

Testing

  • Updated unit tests
  • Ran a test migration locally, you can see the new behavior below:
14:45:31.384 INFO  Unpacking blob files to disk...
14:45:31.384 INFO  Processing index: logs-241998
14:45:31.386 INFO  === Shard ID: 0 ===
14:45:31.396 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/snap-a49xLBn1TqyxcjyKiWgtpA.dat to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/snap-a49xLBn1TqyxcjyKiWgtpA.dat
14:45:31.865 INFO  Downloading blob files from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/ to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0
14:45:32.435 INFO  Blob file download(s) complete
14:45:32.449 INFO  Unpacking - Blob Name: __eUUGlyLIT9-NLqru0m367Q, Lucene Name: _2_Lucene84_0.doc
14:45:32.455 INFO  Unpacking - Blob Name: __iydNTkHZRJqvfZwbdlaJYA, Lucene Name: _2_Lucene84_0.tim
14:45:32.456 INFO  Unpacking - Blob Name: __PsRSXgBYRSWxKELAkEbcpw, Lucene Name: _2.kdd
14:45:32.456 INFO  Unpacking - Blob Name: v__aZ1RTOxzSX-PdX-dRbBz8A, Lucene Name: _2.si
14:45:32.457 INFO  Unpacking - Blob Name: __LVamudogTvGCpgJnb3BxNA, Lucene Name: _2.kdi
14:45:32.457 INFO  Unpacking - Blob Name: __gzYq-jl6TSSCD7w0S17AJw, Lucene Name: _2.fdm
14:45:32.458 INFO  Unpacking - Blob Name: __D7j2BRqiRKOe83Ht0869PA, Lucene Name: _2_Lucene84_0.pos
14:45:32.459 INFO  Unpacking - Blob Name: __IaW8fFBTS7aoQ7-EdYSd4g, Lucene Name: _2.fdt
14:45:32.460 INFO  Unpacking - Blob Name: ___1JaSLAJTHCk-TdYG_6JYg, Lucene Name: _2_Lucene80_0.dvm
14:45:32.461 INFO  Unpacking - Blob Name: __Xpi9vJHESKSPqxGvxDa1RA, Lucene Name: _2.kdm
14:45:32.461 INFO  Unpacking - Blob Name: __o6l-K0YpQ2K695c8tiL7JA, Lucene Name: _2.fdx
14:45:32.462 INFO  Unpacking - Blob Name: v__3VZW6cZNSwqlVaTn0SsIFA, Lucene Name: segments_4
14:45:32.462 INFO  Unpacking - Blob Name: ___8lqMrY7QWiWMoUFMADo1g, Lucene Name: _2_Lucene84_0.tip
14:45:32.463 INFO  Unpacking - Blob Name: __HOR0nDIEQhOu3a-dwtRvKg, Lucene Name: _2.nvd
14:45:32.463 INFO  Unpacking - Blob Name: __tgnM14uTRf-B18lFuxBxPQ, Lucene Name: _2_Lucene80_0.dvd
14:45:32.464 INFO  Unpacking - Blob Name: __gMHUg1-3Ql26o8E3wtH1HQ, Lucene Name: _2.nvm
14:45:32.465 INFO  Unpacking - Blob Name: __3_oFHwvnSZaKCh1pNMbtjA, Lucene Name: _2.fnm
14:45:32.466 INFO  Unpacking - Blob Name: __0Uj5U7guRlO1FVqwar5j4g, Lucene Name: _2_Lucene84_0.tmd

Compare that to:

11:27:45.963 INFO  Unpacking blob files to disk...
11:27:45.963 INFO  Processing index: logs-241998
11:27:45.965 INFO  === Shard ID: 0 ===
11:27:45.976 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/snap-a49xLBn1TqyxcjyKiWgtpA.dat to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/snap-a49xLBn1TqyxcjyKiWgtpA.dat
11:27:46.710 INFO  Unpacking - Blob Name: __eUUGlyLIT9-NLqru0m367Q, Lucene Name: _2_Lucene84_0.doc
11:27:46.714 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__eUUGlyLIT9-NLqru0m367Q to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__eUUGlyLIT9-NLqru0m367Q
11:27:46.976 INFO  Unpacking - Blob Name: __iydNTkHZRJqvfZwbdlaJYA, Lucene Name: _2_Lucene84_0.tim
11:27:46.976 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__iydNTkHZRJqvfZwbdlaJYA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__iydNTkHZRJqvfZwbdlaJYA
11:27:47.154 INFO  Unpacking - Blob Name: __PsRSXgBYRSWxKELAkEbcpw, Lucene Name: _2.kdd
11:27:47.155 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__PsRSXgBYRSWxKELAkEbcpw to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__PsRSXgBYRSWxKELAkEbcpw
11:27:47.347 INFO  Unpacking - Blob Name: v__aZ1RTOxzSX-PdX-dRbBz8A, Lucene Name: _2.si
11:27:47.348 INFO  Unpacking - Blob Name: __LVamudogTvGCpgJnb3BxNA, Lucene Name: _2.kdi
11:27:47.348 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__LVamudogTvGCpgJnb3BxNA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__LVamudogTvGCpgJnb3BxNA
11:27:47.496 INFO  Unpacking - Blob Name: __gzYq-jl6TSSCD7w0S17AJw, Lucene Name: _2.fdm
11:27:47.496 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__gzYq-jl6TSSCD7w0S17AJw to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__gzYq-jl6TSSCD7w0S17AJw
11:27:47.644 INFO  Unpacking - Blob Name: __D7j2BRqiRKOe83Ht0869PA, Lucene Name: _2_Lucene84_0.pos
11:27:47.645 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__D7j2BRqiRKOe83Ht0869PA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__D7j2BRqiRKOe83Ht0869PA
11:27:47.828 INFO  Unpacking - Blob Name: __IaW8fFBTS7aoQ7-EdYSd4g, Lucene Name: _2.fdt
11:27:47.829 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__IaW8fFBTS7aoQ7-EdYSd4g to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__IaW8fFBTS7aoQ7-EdYSd4g
11:27:47.981 INFO  Unpacking - Blob Name: ___1JaSLAJTHCk-TdYG_6JYg, Lucene Name: _2_Lucene80_0.dvm
11:27:47.982 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/___1JaSLAJTHCk-TdYG_6JYg to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/___1JaSLAJTHCk-TdYG_6JYg
11:27:48.147 INFO  Unpacking - Blob Name: __Xpi9vJHESKSPqxGvxDa1RA, Lucene Name: _2.kdm
11:27:48.148 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__Xpi9vJHESKSPqxGvxDa1RA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__Xpi9vJHESKSPqxGvxDa1RA
11:27:48.375 INFO  Unpacking - Blob Name: __o6l-K0YpQ2K695c8tiL7JA, Lucene Name: _2.fdx
11:27:48.376 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__o6l-K0YpQ2K695c8tiL7JA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__o6l-K0YpQ2K695c8tiL7JA
11:27:48.525 INFO  Unpacking - Blob Name: v__3VZW6cZNSwqlVaTn0SsIFA, Lucene Name: segments_4
11:27:48.526 INFO  Unpacking - Blob Name: ___8lqMrY7QWiWMoUFMADo1g, Lucene Name: _2_Lucene84_0.tip
11:27:48.526 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/___8lqMrY7QWiWMoUFMADo1g to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/___8lqMrY7QWiWMoUFMADo1g
11:27:48.674 INFO  Unpacking - Blob Name: __HOR0nDIEQhOu3a-dwtRvKg, Lucene Name: _2.nvd
11:27:48.675 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__HOR0nDIEQhOu3a-dwtRvKg to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__HOR0nDIEQhOu3a-dwtRvKg
11:27:48.837 INFO  Unpacking - Blob Name: __tgnM14uTRf-B18lFuxBxPQ, Lucene Name: _2_Lucene80_0.dvd
11:27:48.838 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__tgnM14uTRf-B18lFuxBxPQ to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__tgnM14uTRf-B18lFuxBxPQ
11:27:49.035 INFO  Unpacking - Blob Name: __gMHUg1-3Ql26o8E3wtH1HQ, Lucene Name: _2.nvm
11:27:49.036 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__gMHUg1-3Ql26o8E3wtH1HQ to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__gMHUg1-3Ql26o8E3wtH1HQ
11:27:49.340 INFO  Unpacking - Blob Name: __3_oFHwvnSZaKCh1pNMbtjA, Lucene Name: _2.fnm
11:27:49.341 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__3_oFHwvnSZaKCh1pNMbtjA to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__3_oFHwvnSZaKCh1pNMbtjA
11:27:49.524 INFO  Unpacking - Blob Name: __0Uj5U7guRlO1FVqwar5j4g, Lucene Name: _2_Lucene84_0.tmd
11:27:49.524 INFO  Downloading file from S3: s3://chelma-iad-rfs-local-testing/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__0Uj5U7guRlO1FVqwar5j4g to /tmp/s3_files/indices/gCmLQ2WOTOyaKAUF2xSdcg/0/__0Uj5U7guRlO1FVqwar5j4g

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chris Helma <chelma+github@amazon.com>
Signed-off-by: Chris Helma <chelma+github@amazon.com>
@chelma chelma changed the title Migrations 1688 S3Repo now uses Async client and TransferManager Apr 25, 2024
@@ -77,4 +77,8 @@ public Path getBlobFilePath(String indexId, int shardId, String blobName) throws
Path filePath = shardDirPath.resolve(blobName);
return filePath;
}

public void prepBlobFiles(ShardMetadata.Data shardMetadata) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes for other implementing methods here too, but curious why you aren't including @Override. I've found it useful especially when methods have default implementations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - completely missed this. Yeah, probably should have done that but I'm a bit baffled why it worked without it? Seems like the compiler should have enforced this... Java noob problems, I guess.

Will add, and will learn what that annotation actually does 😛

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

public static S3Repo create(Path s3LocalDir, S3Uri s3Uri, String s3Region) {
S3AsyncClient s3Client = S3AsyncClient.crtBuilder()
.credentialsProvider(DefaultCredentialsProvider.create())
.region(Region.of(s3Region))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on specifying a retry configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - will do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


import java.util.Comparator;
import java.util.Optional;

public class S3Repo implements SourceRepo {
private static final Logger logger = LogManager.getLogger(S3Repo.class);
private static final double S3_TARGET_THROUGHPUT_GIBPS = 10.0; // Arbitrarily chosen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appears by default, target throughput contributes to a memory usage that could exhaust resources. We should also specify the relatively new attribute maxNativeMemoryLimitInBytes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - will investigate!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

public static S3Repo create(Path s3LocalDir, S3Uri s3Uri, String s3Region) {
S3AsyncClient s3Client = S3AsyncClient.crtBuilder()
.credentialsProvider(DefaultCredentialsProvider.create())
.region(Region.of(s3Region))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add crossRegionAccessEnabled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; I think the answer here is "no" until there's a clear need for it. It will help enforce regionalization which, in general, is a good thing.

Signed-off-by: Chris Helma <chelma+github@amazon.com>

import java.util.Comparator;
import java.util.Optional;

public class S3Repo implements SourceRepo {
private static final Logger logger = LogManager.getLogger(S3Repo.class);
private static final double S3_TARGET_THROUGHPUT_GIBPS = 8.0; // Arbitrarily chosen
private static final long S3_MAX_MEMORY_BYTES = 1024L * 1024 * 1024; // Arbitrarily chosen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit hard to understand what this is (1.073 gb), i'd prefer something like

private static final long S3_MAX_MEMORY_BYTES = 1_000_000_000L; // Minimum value (1 GB)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I'm pretty sure system memory uses GiB, not GB, right? In other words - it's base 2 not base 10. A quick internet search seems to confirm.

@chelma chelma merged commit 577ffc6 into opensearch-project:main Apr 26, 2024
4 of 5 checks passed
@chelma chelma deleted the MIGRATIONS-1688 branch April 26, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants