SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init #1762

patsonluk · 2023-07-06T22:25:05Z

https://issues.apache.org/jira/browse/SOLR-16871

Description

From a unit test case that issue concurrent select queries to coordinator nodes, it’s found that there could be 3 race condition issues:

If multiple concurrent requests find the synthetic collection is not yet created, they might all attempt to create the synthetic collection. This could trigger SolrException on collection already exists
Similarly, if multiple concurrent requests find there’s no replica of the synthetic collection for current node (multiple coordinator node scenario), then CoordinatorHttpSolrCall#addReplica could be invoked multiple times. This should not trigger any exception, but would create multiple replicas for the same node in the synthetic collection
The existing logic here assumes if syntheticColl.getReplicas(solrCall.cores.getZkController().getNodeName()) returns non empty result, then the following call in here should return a core. Unfortunately, the first call can return a non empty list but with a DOWN replica if another request is in the progress of creating such replica. In this case, the solrCall.getCoreByCollection(syntheticCollectionName, isPreferLeader) would call super.getCoreByCollection at here which would return a null (since super impl only returns active replica). So CoordinatorHttpSolrCall#getCoreByCollection would end up calling CoordinatorHttpSolrCall#getCore , introducing an infinite loop and cause stack overflow

Solution

For collection creation exception, check again if the collection exists, if so, ignore the exception and proceed
For replica, if the replica for such node already found in the DocCollection, then ensure that it's active using zkStateReader.waitForState. This avoids the infinite loop caused by the presence of down replica.

Take note that this does NOT avoid the 2nd issue above, concurrent requests can still create multiple replica for the same node in the synthetic collection, though it's probably benign (and unlikely)

Remarks: First attempt was actually provide proper locking to avoid race condition. However, it's quite tricky to get it right - might need to force refresh the zkReader and do multiple extra reads. The extra cost and complexity probably does not justify the gain.

Tests

Added TestCooridnatorRole#testConcurrentAccess to reproduce the issue

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

sonatype-lift · 2023-07-06T23:12:07Z

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java

+          //and then CoordinatorHttpSolrCall will call getCore again hence creating a calling loop
+          try {
+            zkStateReader.waitForState(syntheticCollectionName, 10, TimeUnit.SECONDS, docCollection -> {
+              for (Replica nodeNameSyntheticReplica : docCollection.getReplicas(solrCall.cores.getZkController().getNodeName())) {


NULLPTR_DEREFERENCE: List DocCollection.getReplicas(String) could be null (from the call to DocCollection.getReplicas(...) on line 139) and is dereferenced.

ℹ️ Expand to see all @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

sonatype-lift · 2023-07-06T23:12:09Z

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java

+            });
+          } catch (Exception e) {
+            throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Failed to wait for active replica for synthetic collection [" + syntheticCollectionName + "]", e);
+          }
        }
        core = solrCall.getCoreByCollection(syntheticCollectionName, isPreferLeader);


THREAD_SAFETY_VIOLATION: Unprotected write. Non-private method CoordinatorHttpSolrCall.getCore(...) indirectly mutates container core.SolrResourceLoader.classNameCache via call to Map.put(...) outside of synchronization.
Reporting because a superclass class org.apache.solr.servlet.HttpSolrCall is annotated @ThreadSafe, so we assume that this method can run in parallel with other non-private methods in the class (including itself).

ℹ️ Expand to see all @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

sonatype-lift · 2023-07-06T23:34:16Z

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java

+                TimeUnit.SECONDS,
+                docCollection -> {
+                  for (Replica nodeNameSyntheticReplica :
+                      docCollection.getReplicas(solrCall.cores.getZkController().getNodeName())) {


NULLPTR_DEREFERENCE: List DocCollection.getReplicas(String) could be null (from the call to DocCollection.getReplicas(...) on line 149) and is dereferenced.

ℹ️ Expand to see all @sonatype-lift commands

You can reply with the following commands. For example, reply with @sonatype-lift ignoreall to leave out all findings.

Command Usage

@sonatype-lift ignore Leave out the above finding from this PR

@sonatype-lift ignoreall Leave out all the existing findings from this PR

@sonatype-lift exclude <file|issue|path|tool> Exclude specified file|issue|path|tool from Lift findings by updating your config.toml file

Note: When talking to LiftBot, you need to refresh the page to see its response.
_{Click here to add LiftBot to another repo.}

noblepaul

LGTM

noblepaul · 2023-07-10T19:06:35Z

@patsonluk I've made the changes to avoid race condition in core creation as well

patsonluk · 2023-07-10T22:13:39Z

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java

@@ -208,9 +208,12 @@ private static void setMDCLoggingContext(String collectionName) {
  private static void addReplica(String syntheticCollectionName, CoreContainer cores) {
    SolrQueryResponse rsp = new SolrQueryResponse();
    try {
+      String coreName = syntheticCollectionName + "_" + "r1";


Do we want to include node name as a part of the core name? otherwise 2 coordinator nodes might use same name for the core?

I thought about that. Maybe not required. Core names do not have to be unique

Good to know! 👍🏼

In this case we might want to always call waitForState as in https://github.com/apache/solr/pull/1762/files#diff-eedf409265fc219f98f193ae89d3f1b09df78fe49f70bb5b9eaa6c6ff46e6ac7R143 since addReplica can now return while the target replica is still under construction by another request.

Ah I think we do need to add node name, as the AddReplicaCmd logic would iterate through all slices and throw exception on any core name matches.

Therefore, for the 1st coordinator node, it's fine -> the replica is created by the collection creation call, which follows the standard name, ie .sys.COORDINATOR-COLL-conf_shard1_replica_n1.

The 2nd coordinator node is also fine -> .sys.COORDINATOR-COLL-conf_r1.

However starting from the 3rd coordinator, it would fail with infinite loop, since it will skip adding replica (due to the check linked above), but solrCall.getCoreByCollection(syntheticCollectionName, isPreferLeader) would not be able to load such core (since it's not in 3rd node), hence causing the infinite loop.

Such can be reproduced by modifying node count here in the test case from 2 to 3

true. I'm not even sure why someone added that check

Thank you @noblepaul !

I added another small commit 45654e1 on TestCoordinatorRole to ensure that the fix work:

Changed coordinator node count 2 -> 4

Verified the replica count on the synthetic collection

Also a minor change to the addReplica flow, that we always check for replica status afterwards (since addReplica might now return if exception is thrown and caught, and with replica status not always active, we could run into infinite call loop, this is a rather rare case, but it doesn't hurt to check.

…nator node Minor fix to synthetic collection addReplica flow to ensure no stack overflow

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java

…lection/replica init (#1762)

tflobbe · 2023-07-18T22:36:00Z

@noblepaul @patsonluk, This test has been failing very frequently since merged, did you have time to take a look?

patsonluk · 2023-07-18T22:46:42Z

@noblepaul @patsonluk, This test has been failing very frequently since merged, did you have time to take a look?

@tflobbe thanks for raising the concern, do u have any links to failures?

Update ah found some in http://fucit.org/solr-jenkins-reports/failure-report.html

org.apache.solr.search.TestCoordinatorRole > testConcurrentAccess FAILED
    java.lang.AssertionError: expected:<4> but was:<5>
        at __randomizedtesting.SeedInfo.seed([DD56518160526F0E:124B6E8001D7925D]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.apache.solr.search.TestCoordinatorRole.testConcurrentAccess(TestCoordinatorRole.java:573)

patsonluk · 2023-07-18T23:07:48Z

@noblepaul I think setting the core name does not work as it would not avoid duplicated core on the first coordinator node. I printed out the replica list on failure and it shows:

[core_node2:{
  "core":".sys.COORDINATOR-COLL-conf_shard1_replica_n1",
  "leader":"true",
  "node_name":"127.0.0.1:49656_solr",
  "base_url":"https://127.0.0.1:49656/solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, core_node3:{
  "node_name":"127.0.0.1:49656_solr",
  "base_url":"https://127.0.0.1:49656/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49656_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, core_node4:{
  "node_name":"127.0.0.1:49647_solr",
  "base_url":"https://127.0.0.1:49647/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49647_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, core_node5:{
  "node_name":"127.0.0.1:49650_solr",
  "base_url":"https://127.0.0.1:49650/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49650_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, core_node6:{
  "node_name":"127.0.0.1:49653_solr",
  "base_url":"https://127.0.0.1:49653/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49653_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}]

Any thoughts? 🤔

patsonluk · 2023-07-18T23:37:09Z

@noblepaul what do u think about this proposed fix? 😊 #1794

…lection/replica init (apache#1762)

patsonluk added 3 commits July 6, 2023 15:19

Fixed race conditions for CoordinatorHttpSolrCall

19433cf

./gradlew tidy

5206d7c

Use ExecutorUtil and SolrNamedThreadFactory

b9b0f5f

sonatype-lift bot reviewed Jul 6, 2023

View reviewed changes

noblepaul approved these changes Jul 7, 2023

View reviewed changes

patsonluk marked this pull request as ready for review July 7, 2023 19:25

Handle race condition in core creation

61dc66e

noblepaul requested a review from justinrsweeney July 10, 2023 19:06

patsonluk commented Jul 10, 2023

View reviewed changes

noblepaul and others added 4 commits July 12, 2023 22:49

Use node name in core name

7231b82

replace ':' with '_'

e8b7151

tidy

3bb2e8c

Improved test case to verify synthetic replica count with more coordi…

45654e1

…nator node Minor fix to synthetic collection addReplica flow to ensure no stack overflow

sonatype-lift bot reviewed Jul 12, 2023

View reviewed changes

solr/core/src/java/org/apache/solr/servlet/CoordinatorHttpSolrCall.java Show resolved Hide resolved

noblepaul merged commit fa024e8 into apache:main Jul 13, 2023
2 checks passed

noblepaul pushed a commit that referenced this pull request Jul 13, 2023

SOLR-16871: Race condition in CoordinatorHttpSolrCall synthetic col…

1b74bc9

…lection/replica init (#1762)

patsonluk mentioned this pull request Jul 18, 2023

SOLR-16871: Fix for duplicated replica added from first coordinator node #1794

Merged

7 tasks

patsonluk mentioned this pull request Jul 21, 2023

SOLR-16871: Synchronize on a larger block to avoid race condition in CoordinatorHttpSolrCall init #1800

Merged

7 tasks

patsonluk added a commit to cowpaths/fullstory-solr that referenced this pull request Aug 3, 2023

SOLR-16871: Race condition in CoordinatorHttpSolrCall synthetic col…

f0d966c

…lection/replica init (apache#1762)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init #1762

SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init #1762

patsonluk commented Jul 6, 2023

sonatype-lift bot Jul 6, 2023

sonatype-lift bot Jul 6, 2023

sonatype-lift bot Jul 6, 2023

noblepaul left a comment

noblepaul commented Jul 10, 2023

patsonluk Jul 10, 2023 •

edited

Loading

noblepaul Jul 11, 2023 •

edited

Loading

patsonluk Jul 11, 2023

patsonluk Jul 11, 2023

noblepaul Jul 12, 2023

patsonluk Jul 12, 2023 •

edited

Loading

tflobbe commented Jul 18, 2023

patsonluk commented Jul 18, 2023 •

edited

Loading

patsonluk commented Jul 18, 2023

patsonluk commented Jul 18, 2023

Command	Usage
`@sonatype-lift ignore`	Leave out the above finding from this PR
`@sonatype-lift ignoreall`	Leave out all the existing findings from this PR
`@sonatype-lift exclude <file\|issue\|path\|tool>`	Exclude specified `file\|issue\|path\|tool` from Lift findings by updating your config.toml file

SOLR-16871: Race condition in CoordinatorHttpSolrCall synthetic collection/replica init #1762

SOLR-16871: Race condition in CoordinatorHttpSolrCall synthetic collection/replica init #1762

Conversation

patsonluk commented Jul 6, 2023

Description

Solution

Tests

Checklist

sonatype-lift bot Jul 6, 2023

Choose a reason for hiding this comment

sonatype-lift bot Jul 6, 2023

Choose a reason for hiding this comment

sonatype-lift bot Jul 6, 2023

Choose a reason for hiding this comment

noblepaul left a comment

Choose a reason for hiding this comment

noblepaul commented Jul 10, 2023

patsonluk Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

noblepaul Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

patsonluk Jul 11, 2023

Choose a reason for hiding this comment

patsonluk Jul 11, 2023

Choose a reason for hiding this comment

noblepaul Jul 12, 2023

Choose a reason for hiding this comment

patsonluk Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

tflobbe commented Jul 18, 2023

patsonluk commented Jul 18, 2023 • edited Loading

patsonluk commented Jul 18, 2023

patsonluk commented Jul 18, 2023

SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init #1762

SOLR-16871: Race condition in `CoordinatorHttpSolrCall` synthetic collection/replica init #1762

patsonluk Jul 10, 2023 •

edited

Loading

noblepaul Jul 11, 2023 •

edited

Loading

patsonluk Jul 12, 2023 •

edited

Loading

patsonluk commented Jul 18, 2023 •

edited

Loading