[DRAFT PR]POC code to batch async shards fetch per node #7269

Gaurav614 · 2023-04-21T04:48:59Z

Description

This is the the draft PR to discuss on the idea of batching the AsyncShardFetch used in GatewayAllocator.

In the current flow of code PrimaryShardAllocator/ReplicaShardAllocator(PSA/RSA) calls GatewayAllocator to do Transport calls per shard to N data nodes for async fetching the metadata of shards from all nodes. This information is then used by PSA/RSA to make allocation Decision. Currently GatewayAllocator maintains two such maps asyncFetchStarted and asyncFetchStore to store async data

So we make M(number of unassigned shards)*N(number of data nodes) transport calls for assigning Unassigned Shards.

The current POC is done to batch the async shards Transport call per node and maintain that view of shards metadata from all nodes in GatewayAllocator.

The changes tries to replace one current map asyncFetchStarted whose data(TransportNodesListGatewayStartedShards.NodeGatewayStartedShards) is used in PSA primarily, with shardsPerNode (naming can be done better here though) to maintain the same set of data in list(TransportNodesCollectGatewayStartedShard.ListOfNodeGatewayStartedShards) per node.

Please note the changes are done in crude way(just to convey idea) with reusing of code from current classes and are draft changes

Flow diagrams:

Current overview picture of how and why asyncFetchShards come into picture:

More simpler view:
Whats changing:
Current Transport Calls
Optimized Transport Calls

List of changes:

GatewayAllocator
- Added a map shardsPerNode to replace asyncFetchStarted. The map will contain the current view of fetched data of unassigned shards from all nodes. This will then be used to send the data after adaptation to PSA
- fetchShardsFromNodes. Is a object that maintains the current fetch status from the nodes in cache. This is analogous to AsyncShardFetch
- In beforeAllocation() added collectShardsPerNode(allocation) as a responsibility. beforeAllocation() is called by AllocationService before starting the allocation. This will help in populating the shardsPerNode and when the allocation decision needs to be taken in allocateUnassigned, the data can be used readily by PSA.
- collectShardsPerNode(). Task to build the view of shards MD from all nodes
- TestAsyncShardFetch. Inner class analogous InternalAsyncFetch
- TestInternalPrimaryShardAllocator. Inner class analogous InternalPrimaryShardAllocator. ThefetchData()here do not actually does the heavylifting of fetching, rather adapts the current data in shardsPerNode to the AsyncShardFetch.FetchResult that PSA understands. This currently helps to minimize the scope of changes in POC. We can later accommodate to revamp PSA/RSA to make allocationDecision based on the batched data.
Added New Transport action TransportNodesCollectGatewayStartedShard, analogous TransportNodesListGatewayStartedShards. This is added for just for simplicity of code and we can think of merging it to a single transport action
Added a new AsyncShardsFetchPerNode class to keep do a track async fetch of shards(shardsToCustomDataPathMap) from all nodes. It is analogous to AsyncShardFetch class which does per the same thing per shard.We can also later remove that class or merge this class into AsyncShardFetch depending on the POC approval

Testing

Did a happy test manually by spinning two data nodes and a single master with 16 shards. Restarted the data nodes, shards turned from Unassigned to Assigned .
Also tested on number of transport calls made by adding a log line in nodeOperation, can verify that a single transport call per node is made(if unassigned shards set is not updated). So thats reduction of transport calls from M*N to N.

Issues Resolved

#5098

This might be extension of the above issue since it might not solve completely the memory issue, but solves the transport calls optimization part.

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-04-21T05:15:43Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/14407/
CommitID: aedd103

amkhar

My understanding of the flow :

We're making 2 calls:

first for shards_started : to know on which nodes a particular shard is present
and second for getting shard metadata : to get the metadata of the shard files

Here we're calling both of the API on a shard level. As we need to do it in Async manner we have fetchers stored in a concurrent map for each shard. One fetcher is responsible for fetching the data for one shard. That's why current maps have key as shardId.

Now we want to move this fetching on node level to give a support of batching. For that you are trying to add a NodeLevelFetcher (AsyncShardsFetchPerNode).

Comments/Questions on strategy:

We should change the strategy at both of the calls not just one.
Without changing ReplicaSHardAllocator how are we ensuring that metadata fetch is done on node level ?
AllocationService is making calls in a for loop for every shard for allocateUnassigned method which calls GatewayAllocator then ultimately PSA's makeAllocationDecision is called. I want to understand how this flow is being per node without changing AllocationService at all ?

Comments on code

Please do not copy paste whole class even for draft PR, make use of existing code and move it to common place to it's to understand exactly what is the change.
Make holostic changes where gradle check is not failing, if it's breaking any of the existing changes, please fix those first.
Please try to convey the impact of the change as in how are we benefitting from reduced number of transport calls with almost similar amount of data ?
Let's try to not change the order of flow like whatever is happening after allocation let it happen in that order only and same for before allocation.

amkhar · 2023-05-15T10:36:30Z

server/src/main/java/org/opensearch/gateway/AsyncShardsFetchPerNode.java

+     * Fills the shard fetched data with new (data) nodes and a fresh NodeEntry, and removes from
+     * it nodes that are no longer part of the state.
+     */
+    private void fillShardCacheWithDataNodes(Map<String, AsyncShardsFetchPerNode.NodeEntry<T>> shardCache, DiscoveryNodes nodes) {


Cache is still on node level ?

I'm assuming that the cache object is for every NodeLevelFetcher, so it should contain data for every shard. In that way, shardId should be the new key ..

How can we're managing the map structure same when the whole class is new of a node level fetching ?

shwetathareja · 2023-05-15T15:07:43Z

Thanks @Gaurav614 for taking up this change.

Comments on strategy and overall code changes (Not reviewing the code, just reviewing the strategy):

Don't copy paste code in PRs. If you are copying certain code you need to callout for the benefit of the reviewer.
Raise Draft PR with a plan in mind to convert it to main PR. Also, raise the Draft PR against main branch.
This is a big PR, would prefer more smaller PRs each with specific changes e.g. One PR to just introduce new TransportAction for AsyncFetch per node instead of per shard etc. The final PR should contain the plumbing in GatewayAllocator.
Please add a diagram which can capture the flow of calls across classes and method to understand easily.
Naming in new Transport class and AsyncFetch is confusing. Also please add more comment to understand the code better.
Please list down the failure cases and their handling.
Make list of pending things in PR like both asyncFetch per shard and per node, they need to co-exist in the code e.g. during upgrade. How will you transition from one to another.

shwetathareja · 2023-05-15T13:32:06Z

server/src/main/java/org/opensearch/gateway/AsyncShardsFetchPerNode.java

+     * An action that lists the relevant shard data that needs to be fetched.
+     */
+    public interface Lister<NodesResponse extends BaseNodesResponse<NodeResponse>, NodeResponse extends BaseNodeResponse> {
+        void list(DiscoveryNode[] nodes, Map<ShardId,String> shardsIdMap, ActionListener<NodesResponse> listener);


shouldnt this be single node and List of shards/ Map of shards

shwetathareja · 2023-05-15T13:42:08Z

server/src/main/java/org/opensearch/gateway/AsyncShardsFetchPerNode.java

+     * A node entry, holding the state of the fetched data for a specific shard
+     * for a giving node.
+     */
+    static class NodeEntry<T> {


Shouldnt this be a ShardEntry class now as top level map is keyed with NodeId

shwetathareja · 2023-05-15T14:25:19Z

server/src/main/java/org/opensearch/gateway/GatewayAllocator.java

    }

    @Override
    public void beforeAllocation(final RoutingAllocation allocation) {
        assert primaryShardAllocator != null;
        assert replicaShardAllocator != null;
        ensureAsyncFetchStorePrimaryRecency(allocation);
+
+        //build the view of shards per node here by doing transport calls on nodes and populate shardsPerNode
+        collectShardsPerNode(allocation);


beforeAllocation should be lightweight. This should happen during allocation logic execution.

Gaurav614 · 2023-05-18T06:45:48Z

@shwetathareja @amkhar Thanks a ton for taking time in reviewing and providing valuable insights on it.

We should change the strategy at both of the calls not just one.

@amkhar We cant do this, One calls is used for primary Shards allocation and other one in RSA. Currently the algo first assigns primary and then replicas

Without changing ReplicaSHardAllocator how are we ensuring that metadata fetch is done on node level ?

@amkhar The code changes are not for RSA right now. The focus is on PSA. We will add a new transport Action like we added for the data that PSA needs.

AllocationService is making calls in a for loop for every shard for allocateUnassigned method which calls GatewayAllocator then ultimately PSA's makeAllocationDecision is called. I want to understand how this flow is being per node without changing AllocationService at all ?

So the idea is currently AllocationService once calls beforeAllocation() here

I am using this to create the view(the data PSA needs) here

This way when AllocationService calls GatewayAllocator for assigning the shards, we already have the data collected(if not, then we just bail out that allocation) and when PSA calls fetchData to makeAllocation Decision we return the data from our view of data by adapting to what PSA currently understands here

And as @shwetathareja visual diagrams needs to be added for better understanding. I am working on it and will soon add.

Please try to convey the impact of the change as in how are we benefitting from reduced number of transport calls with almost similar amount of data ?

Do you have any idea how to convey that? General observation have been in large clusters with high number of shards Transport gets choked due to the async Fetches. With this change there will be less stress on transport.

Let's try to not change the order of flow like whatever is happening after allocation let it happen in that order only and same for before allocation.

There's nothing changing afterAllocation, we are using it to clear our caches and DS like existing code is also doing. In before allocation we are adding a code to populate the view in the GatewayAllocator. Since AllocationService calls GatewayAllocator for all unassigned shards in loop, we dont want to populate the view(collectShardsPerNode) for every call. That's why its better to keep the view even beforeAllocationStarts(@shwetathareja your answer to inline comment for it).

Gaurav614 · 2023-05-18T12:03:35Z

4. Please add a diagram which can capture the flow of calls across classes and method to understand easily.

@shwetathareja Tried to capture the details in the flow diagram, Updated the description

github-actions · 2023-05-31T12:21:55Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/16572/
CommitID: 3387329

opensearch-trigger-bot · 2023-06-30T15:22:41Z

This PR is stalled because it has been open for 30 days with no activity. Remove stalled label or comment or this will be closed in 7 days.

opensearch-trigger-bot · 2023-07-07T15:23:04Z

This PR was closed because it has been stalled for 7 days with no activity.

POC code to batch shards fetch per node

aedd103

Gaurav614 requested a review from reta as a code owner April 21, 2023 04:48

Gaurav614 marked this pull request as draft April 21, 2023 04:49

amkhar suggested changes May 15, 2023

View reviewed changes

amkhar reviewed May 15, 2023

View reviewed changes

shwetathareja reviewed May 15, 2023

View reviewed changes

Highlighted copy-pasted code in comments

3387329

This was referenced Jun 16, 2023

[META] Cluster Manager Async Shard Fetch Revamp #8098

Open

[BUG] Async shard fetches taking up GBs of memory causing ClusterManager JVM to spike with large no. of shards (> 50k) #5098

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Jun 30, 2023

opensearch-trigger-bot bot closed this Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT PR]POC code to batch async shards fetch per node #7269

[DRAFT PR]POC code to batch async shards fetch per node #7269

Gaurav614 commented Apr 21, 2023 •

edited

Loading

github-actions bot commented Apr 21, 2023

amkhar left a comment

amkhar May 15, 2023

shwetathareja commented May 15, 2023 •

edited

Loading

shwetathareja May 15, 2023

shwetathareja May 15, 2023

shwetathareja May 15, 2023

Gaurav614 commented May 18, 2023 •

edited

Loading

Gaurav614 commented May 18, 2023

github-actions bot commented May 31, 2023

opensearch-trigger-bot bot commented Jun 30, 2023

opensearch-trigger-bot bot commented Jul 7, 2023

[DRAFT PR]POC code to batch async shards fetch per node #7269

[DRAFT PR]POC code to batch async shards fetch per node #7269

Conversation

Gaurav614 commented Apr 21, 2023 • edited Loading

Description

Testing

Issues Resolved

Check List

github-actions bot commented Apr 21, 2023

Gradle Check (Jenkins) Run Completed with:

amkhar left a comment

Choose a reason for hiding this comment

Comments/Questions on strategy:

Comments on code

amkhar May 15, 2023

Choose a reason for hiding this comment

shwetathareja commented May 15, 2023 • edited Loading

shwetathareja May 15, 2023

Choose a reason for hiding this comment

shwetathareja May 15, 2023

Choose a reason for hiding this comment

shwetathareja May 15, 2023

Choose a reason for hiding this comment

Gaurav614 commented May 18, 2023 • edited Loading

Gaurav614 commented May 18, 2023

github-actions bot commented May 31, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Jun 30, 2023

opensearch-trigger-bot bot commented Jul 7, 2023

Gaurav614 commented Apr 21, 2023 •

edited

Loading

shwetathareja commented May 15, 2023 •

edited

Loading

Gaurav614 commented May 18, 2023 •

edited

Loading