Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow custom configuration of S3Client #102

Merged
merged 10 commits into from
Apr 16, 2024
Merged

Conversation

morgsmccauley
Copy link
Contributor

@morgsmccauley morgsmccauley commented Apr 14, 2024

Previously, the S3Client trait is used to mock requests made to S3 so that we can unit test our S3 related logic. This PR allows custom implementations of this trait to be configured and used within Lake Framework. This grants users greater control over the S3 requests made, which would otherwise not be possible via configuration of s3_config alone. The main motive for this change is to add caching of requests, but the solution is generic enough that it could be extended to many different use cases.

S3Client trait updates

S3Client has been updated to make it easier to work with from external crates. Previously, it was a thin wrapper over GetObject and ListObjects, but it has been abstracted further so the user does not have to deal with the complex return types of these operations.

On success, the S3Client methods now return primitive types, as opposed to high level request wrappers such as GetObjectOutput. This makes both caching and mocking much easier.

Type Erasure

On failure, we return dyn std::error::Error. This allows any error to be returned, rather than the complex SdkError<GetObjectError>. Without this, every error returned from the method would be re-constructed as an S3 error, which doesn't make sense if the error it self is related to, for example, Mutex locking.

This comes with the downside of errors being "opaque", so responding to specific errors is more difficult. If concrete error types are needed, this opaque error will need to be downcast, as you'll see in s3_fetchers.rs.

Each method returns a "newtype" wrapper around dyn std::error::Error to provide some separation between the two methods, and make them a little easier to work with.

Arc<Error> instead of Box<Error>

While Box is cheaper and easier to work with, it isn't thread safe. This makes it harder to work with in an async context. Therefore, I've opted to use Arc instead, even though it comes with a slight runtime penalty.

Dynamic Dispatch

Previously, the code used static dispatch (impl S3Client) to determine S3Client, meaning the underlying type was known at compile time. impl is really just short-hand for generics, so making S3Client configurable meant that this generic needed to be propagated through all public facing methods - essentially creating a breaking change.

To mitigate the above, I have replaced the implementation to use dynamic dispatch dyn S3Client. This removes the need for generics, meaning this change can be released without a major version bump.

Dynamic dispatch comes with a runtime penalty as the concrete type is determined at runtime. But as most s3 requests are made ahead of time (prefetch), this additional overhead should be negligible.

@morgsmccauley morgsmccauley changed the title Feat/custom s3 client feat: Allow custom configuration of S3Client Apr 15, 2024
@@ -406,7 +410,7 @@ async fn start(
// We require to stream blocks consistently, so we need to try to load the block again.

let pending_block_heights = stream_block_heights(
&lake_s3_client,
&*lake_s3_client,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dereference the Box, and pass a reference to the underlying trait object.

fn deref(&self) -> &Self::Target {
&self.0
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially some helpers to make these errors easier to work with

Ok(())
}

pub fn s3_client<T: S3Client + 'static>(self, s3_client: T) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom setter which allows passing S3Client directly, rather than requiring user to wrap in Box.

},
#[error("Failed to convert integer: {error:?}")]
IntConversionError {
#[from]
error: std::num::TryFromIntError,
},
#[error("AWS Smithy byte_stream error: {error:?}")]
AwsSmithyError {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3GetError/S3ListError now contain both AwsError and AwsSmithyError as discussed in "Type Erasure".

@morgsmccauley morgsmccauley marked this pull request as ready for review April 15, 2024 02:01
Copy link
Member

@khorolets khorolets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! Thank you for improving the code base ❤️

I am going to try out this changes with some of the existing indexers to see the amount of changes required for the users to adopt it (mostly to figure out how painful it is around the error handling part). I will update you after that (ETA today-tomorrow)

src/s3_fetchers.rs Outdated Show resolved Hide resolved
src/s3_fetchers.rs Outdated Show resolved Hide resolved
@morgsmccauley
Copy link
Contributor Author

morgsmccauley commented Apr 15, 2024

Thanks @khorolets! I already have a PR which integrates this in to QueryAPI, you can take a look here.

Edit: That probably isn't that helpful actually, it's essentially the same as the implementation here. I'm working on adding caching, and can ping you once that's done if you're interested.

morgsmccauley and others added 3 commits April 15, 2024 20:23
Co-authored-by: Bohdan Khorolets <bogdan@khorolets.com>
Co-authored-by: Bohdan Khorolets <bogdan@khorolets.com>
@morgsmccauley
Copy link
Contributor Author

morgsmccauley commented Apr 16, 2024

@khorolets Implementation with cache here

Copy link
Member

@khorolets khorolets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked the code you've shared and tried this code out and have no objections to release it.

One thing that bothers me a bit is that it will break error handling for those who uses exposed methods from the library since the error have changed a bit.

@morgsmccauley may I ask you to bump the version and update the CHANGELOG.md? Thus I will be able to release the new version with your changes after merging this PR. Thank you!

Copy link
Member

@khorolets khorolets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped the version to 0.7.8

@khorolets khorolets merged commit f440546 into 0.7.x Apr 16, 2024
5 checks passed
@khorolets khorolets deleted the feat/custom-s3-client branch April 16, 2024 12:08
morgsmccauley added a commit to near/queryapi that referenced this pull request Apr 17, 2024
Depends on near/near-lake-framework-rs#102


This PR exposes a new metrics which counts the number of Get requests
made to S3 by `near-lake-framework`. I wanted to start tracking this
metric _before_ I merge the change which reduces them, so I can measure
the impact of that change. The easiest way to track these requests was
to pass a custom `S3Client` to `near-lake-framework`, so we can hook in
to the actual requests made.

The custom `S3Client` (`LakeS3Client`) is exactly the same as the
default implementation in `near-lake-framework` itself, but with the
added metric. This is essentially part 1 for #419, as the "reduction" in
requests will build on this custom client, adding
caching/de-duplication.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce duplicate Lake requests across dedicated streams
2 participants