add rate limit to asset inventory #2055

orouz · 2024-03-24T13:27:39Z

Summary of your changes

adds rate limiting to the asset inventory client. our current usage is only for ListAssets
retries requests made by ListAssets client whenever they failed due to rate limiting
adds a cache to the policies fetcher (using a new cache utility, as we already have 3 caches in this file)

for ListAssets, we only use the per-project quota, which is 100 per minute per consumer project. we do this because:

the consumer project (the one consuming the quota) is set by the user when they run gcloud config set project <project_id> before deploying the agent. (verified with gcloud config get billing/quota_project).
in both our use cases: single-account and organization-account, we always use a single quota project. we never re-define it for the user
in anyway, the org quotas are 800 per minute per org and 650,000 per day per org. so the per-project quota is more restrictive than both of these, meaning we shouldn't exceed those either.
we don't account for multiple cloudbeat instances running together without being synced on the quotas each of them consume (this is assumed to be an unlikely edge case we accept for now)

Screenshot/Data

test script

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	asset "cloud.google.com/go/asset/apiv1"
	"cloud.google.com/go/asset/apiv1/assetpb"
	"github.com/googleapis/gax-go"
	"golang.org/x/time/rate"
	"google.golang.org/api/iterator"
	"google.golang.org/api/option"
	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
)

type Iterator interface {
	Next() (*assetpb.Asset, error)
}

var RetryOnResourceExhausted = gax.WithRetry(func() gax.Retryer {
	return gax.OnCodes([]codes.Code{codes.ResourceExhausted}, gax.Backoff{
		Initial:    1 * time.Second,
		Max:        10 * time.Second,
		Multiplier: 1.2,
	})
})

type AssetsInventoryRateLimiter struct {
	methods map[string]*rate.Limiter
}

const projectQuota = 100

func NewAssetsInventoryRateLimiter() *AssetsInventoryRateLimiter {
	return &AssetsInventoryRateLimiter{
		methods: map[string]*rate.Limiter{
			"/google.cloud.asset.v1.AssetService/ListAssets": rate.NewLimiter(rate.Every(time.Minute/projectQuota), 1),
		},
	}
}

func (rl *AssetsInventoryRateLimiter) GetInterceptorDialOption() grpc.DialOption {
	return grpc.WithUnaryInterceptor(func(ctx context.Context, method string, req, reply interface{}, cc *grpc.ClientConn, invoker grpc.UnaryInvoker, opts ...grpc.CallOption) error {
		limiter := rl.methods[method]
		if limiter != nil {
			limiter.Wait(ctx)
		}
		return invoker(ctx, method, req, reply, cc, opts...)
	})
}

func main() {
	ctx := context.Background()
	limiter := NewAssetsInventoryRateLimiter()
	clientA, err := asset.NewClient(ctx, option.WithGRPCDialOption(limiter.GetInterceptorDialOption()))

	if err != nil {
		log.Fatalf("failed to create client: %v", err)
	}

	var totalGot int
	var totalLost int
	start := time.Now()
	// simulate 10x the project per-minute quota by requesting some assets multiple times
	var assets []*assetpb.Asset
	for i := 0; i < projectQuota*10; i++ {
		log.Printf("Iteration: %d \n", i)
		resp := getAllAssets(clientA.ListAssets(ctx, &assetpb.ListAssetsRequest{
			Parent:      fmt.Sprintf("organizations/%s", "693506308612"),
			AssetTypes:  []string{"logging.googleapis.com/LogBucket"},
			ContentType: assetpb.ContentType_RESOURCE,
		}, RetryOnResourceExhausted))
		if resp == nil {
			totalLost++
		} else {
			totalGot++
			assets = append(assets, resp...)
		}
	}
	end := time.Now()
	log.Println("-----------------------------------------")
	log.Printf("time: %v \n", end.Sub(start))
	log.Printf("assets: %d \n", len(assets))
	log.Printf("requests lost: %d \n", totalLost)
	log.Printf("requests got: %d \n", totalGot)
}

func getAllAssets(it Iterator) []*assetpb.Asset {
	results := make([]*assetpb.Asset, 0)
	for {
		response, err := it.Next()
		if err == iterator.Done {
			break
		}
		if err != nil {
			return nil
		}
		results = append(results, response)
	}
	return results
}

this is roughly the same code from the assets inventory provider that calls ListAssets
you can change the Parent param to a different organization. make sure to re-run gcloud auth login and gcloud auth application-default login
the script limits requests to 100 per minute and retries failed ones. (you can adjust the quota here)
it simulates consumption of 10x the quota
eventually (+10mins) returns requests got: 1000, which means it doesn't lose any requests.
before this PR, the same usage would error after 100 requests and then just log the error and continue. you can remove the rl.Wait() and RetryOnResourceExhausted to see verify

Related Issues

part of Account for cloud vendors rate limiting #2054
closes Account for rate limiting in GCP fetchers #2073

mergify · 2024-03-24T13:28:22Z

This pull request does not have a backport label. Could you fix it @orouz? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit
NOTE: backport-skip has been added to this pull request.

mergify · 2024-03-24T14:01:32Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b gcp_assets_inventory_rl upstream/gcp_assets_inventory_rl
git merge upstream/main
git push upstream gcp_assets_inventory_rl

github-actions · 2024-03-31T08:03:35Z

📊 Allure Report - 💚 No failures were reported.

Result	Count
🟥 Failed	0
🟩 Passed	359
⬜ Skipped	33

orouz · 2024-04-02T14:58:11Z

internal/flavors/benchmark/gcp.go

@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co

 	return builder.New(
 		builder.WithBenchmarkDataProvider(bdp),
+		builder.WithManagerTimeout(cfg.Period),


i'm not sure why we need the manager timeout at all instead of just letting it work for as long as the cycle lasts, but after this PR the GCP fetchers will be slower, going at a rate of 100 requests per minute, so for a 1000 requests, that would be 10 minutes, which could conflict with the manager timeout, which has a default of 10m, as the context would be cancelled. given that, i've changed the manager timeout for GCP to be the same as the CSPM cycle period, which is 24h.

It makes sense.
Once we update the rest of the cloud providers to have a rate limiter we should consider removing the manager timeout option and configuring all to be limited to the interval period (24h).

I think that we should still consider a scale scenario when the resources combined with the rate limiting exceed the cycle time and make sure that we perform the work up until the end instead of sending partial cycles, but we will still need to have some sort of a upper bound limit in order to make sure that avoid "infinite cycles" (not part of this PR).

@jeniawhite I agree, but this seems to be a new feature we need to create. The upper bound limit could be (without significant effort) something like this: a cycle still running could postpone a maximum of N new cycles and then get canceled.

However, we should implement it as a new feature and consider the edge scenarios.

good point. i've opened an issue to handle this scenario - #2180

orouz · 2024-04-02T14:59:57Z

internal/resources/providers/gcplib/inventory/provider.go

@@ -405,19 +400,21 @@ func getAssetsByProject[T any](assets []*ExtendedGcpAsset, log *logp.Logger, f T
 	return enrichedAssets
 }

-func getAllAssets(log *logp.Logger, it Iterator) []*assetpb.Asset {
+func (p *Provider) getAllAssets(ctx context.Context, request *assetpb.ListAssetsRequest) []*assetpb.Asset {


i've changed this function to be a method on the provider so we can use it for all fetching and log the request details we're about to make in one place, instead of sprinkling a bunch of p.log.Infof(...) every time we call getAllAssets

orouz · 2024-04-02T15:01:48Z

internal/resources/providers/gcplib/inventory/provider.go

-			log.Errorf("Error fetching GCP Asset: %s", err)
-			return nil
+			p.log.Errorf("Error fetching GCP Asset: %s", err)
+			return results


this isn't part of the rate limiting bug fix, but still a fix - we used to return nil whenever we got an error, which comes from a request to the next page. but if for example we got page 1 and already populated results with data, then got an error on page 2, we still returned nil instead of the results we already got. so now we return what we already have after getting an error.

That's great and I can see that eventually this is being appended into assets, we just need to make sure that there is no hidden logic that differentiates between nil and actually getting results.
Because we do not want to act is if everything succeeded when we have a partial response.
From what I saw in the code we act in a opportunistic sort of way, but I would consider in the future to recognize when we have partial results and potentially act on it instead of skipping the whole cycle (not in the scope of this PR).

we still log the error like we did before, and the cycle continues without interruption like we did before, it just has a bit more findings to report. overall i think this change is safe, as returning nil or an empty []*assetpb.Asset is the same in the sense that we never differentiate between the two, we just don't iterate on an empty value.

orouz · 2024-04-02T15:04:45Z

internal/resources/providers/gcplib/inventory/provider.go

+func getAncestorsAssets(ctx context.Context, ancestorsPolicies map[string][]*ExtendedGcpAsset, p *Provider, ancestors []string) []*ExtendedGcpAsset {
 	return lo.Flatten(lo.Map(ancestors, func(parent string, _ int) []*ExtendedGcpAsset {
+		if ancestorsPolicies[parent] != nil {
+			return ancestorsPolicies[parent]
+		}


this function is called during an iteration on a user's projects and it fetches policies for each project ancestors. i've added a cache because ancestors are prone to be identical between different projects. (for example, the last ancestor item - organizations/123 will always be the same). i've tested this locally and got a lot of cache hits for organization / folders. this will reduce the number of api calls the policies fetcher is making.

orouz · 2024-04-02T15:13:58Z

internal/resources/providers/gcplib/inventory/grpc_rate_limiter.go

+
+// a map of asset inventory client methods and their quotas.
+// see https://cloud.google.com/asset-inventory/docs/quota
+var methods = map[string]*rate.Limiter{


if at some point we'll use more methods from the assets inventory, we can add them here.

it might be better to add the rate limiting directly to the ListAssets method instead of adding it to the whole assets inventory client and only limit methods we pre-define, but i didn't find a way to do this. (the grpc.CallOption interface does not export relevant types)

so what happens in case we call a method that is not ListAssets? the interceptor is still active?

the interceptor will be called but we'll not wait, right?

the interceptor will be called but we'll not wait, right?

yeah the interceptor will just be a pass-through function

uri-weisman

Looks great! 💪
Left some questions...

uri-weisman · 2024-04-07T11:11:51Z

internal/flavors/benchmark/gcp.go

@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co

 	return builder.New(
 		builder.WithBenchmarkDataProvider(bdp),
+		builder.WithManagerTimeout(cfg.Period),


It makes sense.
Once we update the rest of the cloud providers to have a rate limiter we should consider removing the manager timeout option and configuring all to be limited to the interval period (24h).

uri-weisman · 2024-04-07T11:27:09Z

internal/resources/providers/gcplib/inventory/grpc_rate_limiter.go

+
+// a map of asset inventory client methods and their quotas.
+// see https://cloud.google.com/asset-inventory/docs/quota
+var methods = map[string]*rate.Limiter{


so what happens in case we call a method that is not ListAssets? the interceptor is still active?

uri-weisman · 2024-04-07T11:28:09Z

internal/resources/providers/gcplib/inventory/grpc_rate_limiter.go

+
+// a map of asset inventory client methods and their quotas.
+// see https://cloud.google.com/asset-inventory/docs/quota
+var methods = map[string]*rate.Limiter{


the interceptor will be called but we'll not wait, right?

internal/resources/providers/gcplib/inventory/provider.go

orouz · 2024-04-08T13:55:48Z

internal/resources/providers/gcplib/inventory/map_cache.go

we already have a cache utility but it is used for single values and is cycle-aware

this cache just abstracts the repetitive read/write using a plain map would require and instead takes a function to get a value which will be used for initial read and assignment.

internal/resources/providers/gcplib/inventory/map_cache.go

jeniawhite · 2024-04-16T18:57:11Z

internal/flavors/benchmark/gcp.go

@@ -49,6 +49,7 @@ func (g *GCP) NewBenchmark(ctx context.Context, log *logp.Logger, cfg *config.Co

 	return builder.New(
 		builder.WithBenchmarkDataProvider(bdp),
+		builder.WithManagerTimeout(cfg.Period),


I think that we should still consider a scale scenario when the resources combined with the rate limiting exceed the cycle time and make sure that we perform the work up until the end instead of sending partial cycles, but we will still need to have some sort of a upper bound limit in order to make sure that avoid "infinite cycles" (not part of this PR).

jeniawhite · 2024-04-16T19:33:29Z

internal/resources/providers/gcplib/inventory/provider.go

-		projectName = crm.getProjectDisplayName(ctx, keys.parentProject)
+		// some assets are not associated with a project
+		if projectId != "" {
+			projectName = p.crm.getProjectDisplayName(ctx, fmt.Sprintf("projects/%s", projectId))


Notice that this changes the behavior.

Previously we've returned an empty string and printed a log.
Now we do not print any log and we do not manipulate the projectName value (which should be empty string due to the initial declaration).

Another side effect is that we do not push this value and key to the cache (wondering if that effects any of the flows).

this does change the behavior we had before, but i think it's ok. we used to try to fetch project names using a project id of empty string, which resulted in an empty project name. we did this multiple times, so sometimes we got the empty project name from cache. in any case, after #2085 was merged, we don't send empty project names anyway:

cloudbeat/internal/dataprovider/providers/cloud/data_provider.go

Line 65 in 699424b

insertIfNotEmpty(cloudAccountNameField, strings.FirstNonEmpty(resMetadata.AccountName, a.accountName), event),

so the outcome of this change is ultimately just not sending redundant api calls to fetch empty project names.

internal/resources/providers/gcplib/inventory/provider.go

jeniawhite · 2024-04-16T20:09:51Z

internal/resources/providers/gcplib/inventory/provider.go

-			log.Errorf("Error fetching GCP Asset: %s", err)
-			return nil
+			p.log.Errorf("Error fetching GCP Asset: %s", err)
+			return results


That's great and I can see that eventually this is being appended into assets, we just need to make sure that there is no hidden logic that differentiates between nil and actually getting results.
Because we do not want to act is if everything succeeded when we have a partial response.
From what I saw in the code we act in a opportunistic sort of way, but I would consider in the future to recognize when we have partial results and potentially act on it instead of skipping the whole cycle (not in the scope of this PR).

moukoublen · 2024-04-17T06:40:27Z

internal/resources/providers/gcplib/inventory/provider.go

+		log:                       log,
+		inventory:                 assetsInventoryWrapper,
+		crm:                       crmServiceWrapper,
+		cloudAccountMetadataCache: NewMapCache[*fetching.CloudAccountMetadata](),


Is Provider lifecycle per-cycle or long-lived? I think it is long-lived, its initialized once at the startup of cloudbeat, but perhaps I am mistaken here.

If it is, then shouldn't we be extra careful what we cache once and only? Is cloud account metadata safe to retrieve only once per lifetime? If not we should use one of the many in-mem cache libraries that implements global ttl or per key ttl.

(Alternatively, a cache-per-cycle could be also a safe choice)

before this PR, the cache was a plain map:

- crmCache map[string]*fetching.CloudAccountMetadata + cloudAccountMetadataCache *MapCache[*fetching.CloudAccountMetadata]

so it's still the same behavior as before, only now a little less repetitive as MapCache just takes away the operations we'd need to do to cache values using a plain map[string]

in general though, i agree this behaviour is probably not correct, even though project/org names probably rarely change, we'd probably still want to get fresh values. i've opened an issue to address this in a separate PR (as the behaviour itself didn't change from before and is unrelated to this PR)

(cherry picked from commit b8ffed9)

orouz added the gcp label Mar 24, 2024

mergify bot assigned orouz Mar 24, 2024

mergify bot added the backport-skip label Mar 24, 2024

orouz force-pushed the gcp_assets_inventory_rl branch from c7c3406 to b83c741 Compare March 24, 2024 13:44

orouz force-pushed the gcp_assets_inventory_rl branch 4 times, most recently from b91d3c1 to 9502f81 Compare March 31, 2024 07:38

orouz force-pushed the gcp_assets_inventory_rl branch 6 times, most recently from 68ff83e to e90006b Compare April 2, 2024 14:53

orouz marked this pull request as ready for review April 2, 2024 16:21

orouz requested a review from a team as a code owner April 2, 2024 16:21

orouz commented Apr 2, 2024

View reviewed changes

orouz requested review from jeniawhite and uri-weisman April 2, 2024 16:22

orouz force-pushed the gcp_assets_inventory_rl branch 3 times, most recently from 85e60a8 to 7625e87 Compare April 7, 2024 10:28

uri-weisman reviewed Apr 7, 2024

View reviewed changes

elastic deleted a comment from github-actions bot Apr 7, 2024

orouz force-pushed the gcp_assets_inventory_rl branch from 7625e87 to 8545a5e Compare April 8, 2024 13:50

orouz commented Apr 8, 2024

View reviewed changes

orouz force-pushed the gcp_assets_inventory_rl branch from 8545a5e to 0f7346d Compare April 8, 2024 14:10

orouz force-pushed the gcp_assets_inventory_rl branch from 0f7346d to f1aa8f7 Compare April 8, 2024 14:35

orouz requested a review from uri-weisman April 8, 2024 14:44

uri-weisman reviewed Apr 15, 2024

View reviewed changes

internal/resources/providers/gcplib/inventory/map_cache.go Outdated Show resolved Hide resolved

orouz force-pushed the gcp_assets_inventory_rl branch from f1aa8f7 to 699424b Compare April 15, 2024 18:17

uri-weisman approved these changes Apr 16, 2024

View reviewed changes

jeniawhite reviewed Apr 16, 2024

View reviewed changes

moukoublen reviewed Apr 17, 2024

View reviewed changes

This was referenced May 7, 2024

Account for workloads exceeding cycle period #2180

Open

Use cycle-aware cache for GCP cloud metadata #2182

Open

orouz added 5 commits May 7, 2024 13:23

add rate limiter

9d3365e

use cycle period as manager timeout

f5c341d

add cache

db1387d

use sync.Map

f509116

review changes: return instead of redundant assignment

078d7c6

orouz force-pushed the gcp_assets_inventory_rl branch from 699424b to 078d7c6 Compare May 7, 2024 10:24

orouz added backport-v8.14.0 and removed backport-skip labels May 7, 2024

Merge branch 'main' into gcp_assets_inventory_rl

3cf3bc8

orouz merged commit b8ffed9 into elastic:main May 7, 2024
24 checks passed

mergify bot pushed a commit that referenced this pull request May 7, 2024

add rate limit to asset inventory (#2055)

1a85091

(cherry picked from commit b8ffed9)

mergify bot mentioned this pull request May 7, 2024

[8.14](backport #2055) add rate limit to asset inventory #2184

Merged

orouz pushed a commit that referenced this pull request May 7, 2024

[8.14](backport #2055) add rate limit to asset inventory (#2184)

08de713

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add rate limit to asset inventory #2055

add rate limit to asset inventory #2055

orouz commented Mar 24, 2024 •

edited

Loading

mergify bot commented Mar 24, 2024

mergify bot commented Mar 24, 2024

github-actions bot commented Mar 31, 2024 •

edited

Loading

orouz Apr 2, 2024

uri-weisman Apr 7, 2024

jeniawhite Apr 16, 2024 •

edited

Loading

moukoublen Apr 17, 2024

orouz May 7, 2024

orouz Apr 2, 2024

orouz Apr 2, 2024

jeniawhite Apr 16, 2024

orouz May 7, 2024

orouz Apr 2, 2024

orouz Apr 2, 2024

uri-weisman Apr 7, 2024

uri-weisman Apr 7, 2024

orouz Apr 7, 2024

uri-weisman left a comment

uri-weisman Apr 7, 2024

uri-weisman Apr 7, 2024

uri-weisman Apr 7, 2024

orouz Apr 8, 2024

jeniawhite Apr 16, 2024 •

edited

Loading

jeniawhite Apr 16, 2024 •

edited

Loading

orouz May 7, 2024

jeniawhite Apr 16, 2024

moukoublen Apr 17, 2024 •

edited

Loading

orouz May 7, 2024

add rate limit to asset inventory #2055

add rate limit to asset inventory #2055

Conversation

orouz commented Mar 24, 2024 • edited Loading

Summary of your changes

Screenshot/Data

Related Issues

mergify bot commented Mar 24, 2024

mergify bot commented Mar 24, 2024

github-actions bot commented Mar 31, 2024 • edited Loading

📊 Allure Report - 💚 No failures were reported.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeniawhite Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uri-weisman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeniawhite Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

jeniawhite Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moukoublen Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orouz commented Mar 24, 2024 •

edited

Loading

github-actions bot commented Mar 31, 2024 •

edited

Loading

jeniawhite Apr 16, 2024 •

edited

Loading

jeniawhite Apr 16, 2024 •

edited

Loading

jeniawhite Apr 16, 2024 •

edited

Loading

moukoublen Apr 17, 2024 •

edited

Loading