Perform background refresh of credentials during preempt expiry period #3541
+327
−39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PreemptExpiryTime
period.Description
Adds a new method
GetTimeToLive
to theRefreshingAWSCredentials.CredentialsRefreshState
class which calculates the remaining time to live (TTL) for a credential, adjusting for the "baked-in" preempt expiry time period. Within theGetCredentials(Async)
method, the TTL is used to determine whether the current credentials are in one of three states: valid, expired, or valid but within the preempt expiry period. When in that last state, the current (valid, non-expired) credentials will be returned and a background refresh of the credentials will be attempted. If there is already an in-flight attempt (inline or background) to refresh the credentials, then a new background refresh will not be triggered. When in the expired state, an inline request to generate new credentials will still be triggered; however, after acquiring the mutual exclusion lock, the current credentials will be re-evaluated for whether they are still expired or not. This double-check helps to elide calls toGenerateNewCredentials(Async)
when multiple tasks were in queue to acquire the refresh credentials lock, and preserves the existing behavior which contains the expiry check within the lock.Motivation and Context
We have encountered an issue in our containerized HTTP API services that talk to AWS services (such as DynamoDB) while they are under load. The root cause is not an issue with the AWS SDK; however, an interesting cascading effect we have observed is "blips" in response times during an AWS credential refresh, in many cases leading to client request timeouts.
In the current implementation of the
RefreshingAWSCredentials
class, every call toGetCredentialsAsync
will attempt to obtain exclusive access by callingSemaphoreSlim.WaitAsync()
. When theGenerateNewCredentialsAsync
call is delayed, then all calls to obtain credentials are blocked. In our service, since every incoming request is making at least one AWS service call, this effectively blocks all requests until it completes. This then leads to increased memory usage as all task continuation are enqueued with theSemaphoreSlim
. If enough of these continuations are enqueued, GC pressure mounts, with the GC consuming more CPU time but unable to remove any of the rooted contexts, ultimately resulting in a negative feedback loop where the process spends most of its time in futile GC attempts. In the image below, there are over 1,000 continuations enqueued waiting for the new credentials to be generated, which consumes around 100MB.This PR attempts to bypass any delays (and lock contention in general) with generating new credentials by attempting to perform a refresh of the credentials using a single background task.
Testing
We were able to reproduce the issue by using a custom implementation of
AssumeRoleWithWebIdentityCredentials
which allowed us to introduce a configurable amount of delay in theGenerateNewCredentialsAsync
method. Additionally, we configured thePreemptExpiryTime
value to be 59 minutes so that new credentials would be generated every 1 minute.New unit tests were added to the solution to cover both existing and new functionality of the
RefreshingAWSCredentials
class.Screenshots (if appropriate)
Types of changes
Checklist
License