-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add rate limit to asset inventory #2055
Conversation
This pull request does not have a backport label. Could you fix it @orouz? 🙏
|
c7c3406
to
b83c741
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
b91d3c1
to
9502f81
Compare
📊 Allure Report - 💚 No failures were reported.
|
68ff83e
to
e90006b
Compare
@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co | |||
|
|||
return builder.New( | |||
builder.WithBenchmarkDataProvider(bdp), | |||
builder.WithManagerTimeout(cfg.Period), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure why we need the manager timeout at all instead of just letting it work for as long as the cycle lasts, but after this PR the GCP fetchers will be slower, going at a rate of 100 requests per minute, so for a 1000 requests, that would be 10 minutes, which could conflict with the manager timeout, which has a default of 10m
, as the context would be cancelled. given that, i've changed the manager timeout for GCP to be the same as the CSPM cycle period, which is 24h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
Once we update the rest of the cloud providers to have a rate limiter we should consider removing the manager timeout option and configuring all to be limited to the interval period (24h).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should still consider a scale scenario when the resources combined with the rate limiting exceed the cycle time and make sure that we perform the work up until the end instead of sending partial cycles, but we will still need to have some sort of a upper bound limit in order to make sure that avoid "infinite cycles" (not part of this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeniawhite I agree, but this seems to be a new feature we need to create. The upper bound limit could be (without significant effort) something like this: a cycle still running could postpone a maximum of N new cycles and then get canceled.
However, we should implement it as a new feature and consider the edge scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. i've opened an issue to handle this scenario - #2180
@@ -405,19 +400,21 @@ func getAssetsByProject[T any](assets []*ExtendedGcpAsset, log *logp.Logger, f T | |||
return enrichedAssets | |||
} | |||
|
|||
func getAllAssets(log *logp.Logger, it Iterator) []*assetpb.Asset { | |||
func (p *Provider) getAllAssets(ctx context.Context, request *assetpb.ListAssetsRequest) []*assetpb.Asset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've changed this function to be a method on the provider so we can use it for all fetching and log the request details we're about to make in one place, instead of sprinkling a bunch of p.log.Infof(...)
every time we call getAllAssets
log.Errorf("Error fetching GCP Asset: %s", err) | ||
return nil | ||
p.log.Errorf("Error fetching GCP Asset: %s", err) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't part of the rate limiting bug fix, but still a fix - we used to return nil
whenever we got an error, which comes from a request to the next page. but if for example we got page 1 and already populated results
with data, then got an error on page 2, we still returned nil
instead of the results we already got. so now we return what we already have after getting an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great and I can see that eventually this is being appended into assets, we just need to make sure that there is no hidden logic that differentiates between nil
and actually getting results.
Because we do not want to act is if everything succeeded when we have a partial response.
From what I saw in the code we act in a opportunistic sort of way, but I would consider in the future to recognize when we have partial results and potentially act on it instead of skipping the whole cycle (not in the scope of this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still log the error like we did before, and the cycle continues without interruption like we did before, it just has a bit more findings to report. overall i think this change is safe, as returning nil
or an empty []*assetpb.Asset
is the same in the sense that we never differentiate between the two, we just don't iterate on an empty value.
func getAncestorsAssets(ctx context.Context, ancestorsPolicies map[string][]*ExtendedGcpAsset, p *Provider, ancestors []string) []*ExtendedGcpAsset { | ||
return lo.Flatten(lo.Map(ancestors, func(parent string, _ int) []*ExtendedGcpAsset { | ||
if ancestorsPolicies[parent] != nil { | ||
return ancestorsPolicies[parent] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is called during an iteration on a user's projects and it fetches policies for each project ancestors. i've added a cache because ancestors are prone to be identical between different projects. (for example, the last ancestor item - organizations/123
will always be the same). i've tested this locally and got a lot of cache hits for organization / folders. this will reduce the number of api calls the policies fetcher is making.
|
||
// a map of asset inventory client methods and their quotas. | ||
// see https://cloud.google.com/asset-inventory/docs/quota | ||
var methods = map[string]*rate.Limiter{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if at some point we'll use more methods from the assets inventory, we can add them here.
it might be better to add the rate limiting directly to the ListAssets method instead of adding it to the whole assets inventory client and only limit methods we pre-define, but i didn't find a way to do this. (the grpc.CallOption interface does not export relevant types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what happens in case we call a method that is not ListAssets
? the interceptor is still active?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interceptor will be called but we'll not wait, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interceptor will be called but we'll not wait, right?
yeah the interceptor will just be a pass-through function
85e60a8
to
7625e87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 💪
Left some questions...
@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co | |||
|
|||
return builder.New( | |||
builder.WithBenchmarkDataProvider(bdp), | |||
builder.WithManagerTimeout(cfg.Period), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense.
Once we update the rest of the cloud providers to have a rate limiter we should consider removing the manager timeout option and configuring all to be limited to the interval period (24h).
|
||
// a map of asset inventory client methods and their quotas. | ||
// see https://cloud.google.com/asset-inventory/docs/quota | ||
var methods = map[string]*rate.Limiter{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what happens in case we call a method that is not ListAssets
? the interceptor is still active?
|
||
// a map of asset inventory client methods and their quotas. | ||
// see https://cloud.google.com/asset-inventory/docs/quota | ||
var methods = map[string]*rate.Limiter{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interceptor will be called but we'll not wait, right?
7625e87
to
8545a5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have a cache utility but it is used for single values and is cycle-aware
this cache just abstracts the repetitive read/write using a plain map
would require and instead takes a function to get a value which will be used for initial read and assignment.
8545a5e
to
0f7346d
Compare
0f7346d
to
f1aa8f7
Compare
f1aa8f7
to
699424b
Compare
@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co | |||
|
|||
return builder.New( | |||
builder.WithBenchmarkDataProvider(bdp), | |||
builder.WithManagerTimeout(cfg.Period), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should still consider a scale scenario when the resources combined with the rate limiting exceed the cycle time and make sure that we perform the work up until the end instead of sending partial cycles, but we will still need to have some sort of a upper bound limit in order to make sure that avoid "infinite cycles" (not part of this PR).
projectName = crm.getProjectDisplayName(ctx, keys.parentProject) | ||
// some assets are not associated with a project | ||
if projectId != "" { | ||
projectName = p.crm.getProjectDisplayName(ctx, fmt.Sprintf("projects/%s", projectId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that this changes the behavior.
Previously we've returned an empty string and printed a log.
Now we do not print any log and we do not manipulate the projectName value (which should be empty string due to the initial declaration).
Another side effect is that we do not push this value and key to the cache (wondering if that effects any of the flows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does change the behavior we had before, but i think it's ok. we used to try to fetch project names using a project id of empty string, which resulted in an empty project name. we did this multiple times, so sometimes we got the empty project name from cache. in any case, after #2085 was merged, we don't send empty project names anyway:
insertIfNotEmpty(cloudAccountNameField, strings.FirstNonEmpty(resMetadata.AccountName, a.accountName), event), |
so the outcome of this change is ultimately just not sending redundant api calls to fetch empty project names.
log.Errorf("Error fetching GCP Asset: %s", err) | ||
return nil | ||
p.log.Errorf("Error fetching GCP Asset: %s", err) | ||
return results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great and I can see that eventually this is being appended into assets, we just need to make sure that there is no hidden logic that differentiates between nil
and actually getting results.
Because we do not want to act is if everything succeeded when we have a partial response.
From what I saw in the code we act in a opportunistic sort of way, but I would consider in the future to recognize when we have partial results and potentially act on it instead of skipping the whole cycle (not in the scope of this PR).
log: log, | ||
inventory: assetsInventoryWrapper, | ||
crm: crmServiceWrapper, | ||
cloudAccountMetadataCache: NewMapCache[*fetching.CloudAccountMetadata](), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Provider
lifecycle per-cycle or long-lived? I think it is long-lived, its initialized once at the startup of cloudbeat, but perhaps I am mistaken here.
If it is, then shouldn't we be extra careful what we cache once and only? Is cloud account metadata safe to retrieve only once per lifetime? If not we should use one of the many in-mem cache libraries that implements global ttl or per key ttl.
(Alternatively, a cache-per-cycle could be also a safe choice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this PR, the cache was a plain map:
- crmCache map[string]*fetching.CloudAccountMetadata
+ cloudAccountMetadataCache *MapCache[*fetching.CloudAccountMetadata]
so it's still the same behavior as before, only now a little less repetitive as MapCache
just takes away the operations we'd need to do to cache values using a plain map[string]
in general though, i agree this behaviour is probably not correct, even though project/org names probably rarely change, we'd probably still want to get fresh values. i've opened an issue to address this in a separate PR (as the behaviour itself didn't change from before and is unrelated to this PR)
699424b
to
078d7c6
Compare
(cherry picked from commit b8ffed9)
Summary of your changes
ListAssets
ListAssets
client whenever they failed due to rate limitingfor
ListAssets
, we only use the per-project quota, which is100 per minute per consumer project
. we do this because:gcloud config set project <project_id>
before deploying the agent. (verified withgcloud config get billing/quota_project
).single-account
andorganization-account
, we always use a single quota project. we never re-define it for the user800 per minute per org
and650,000 per day per org
. so the per-project quota is more restrictive than both of these, meaning we shouldn't exceed those either.Screenshot/Data
test script
ListAssets
Parent
param to a different organization. make sure to re-rungcloud auth login
andgcloud auth application-default login
requests got: 1000
, which means it doesn't lose any requests.rl.Wait()
andRetryOnResourceExhausted
to see verifyRelated Issues