Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16871: Fix for duplicated replica added from first coordinator node #1794

Conversation

patsonluk
Copy link
Contributor

@patsonluk patsonluk commented Jul 18, 2023

https://issues.apache.org/jira/browse/SOLR-16871

Description

PR #1762 fixes various race condition for coordinator node. One of the fixes restricts the name of the synthetic core to ensure at most 1 core is created per coordinator node.

Unfortunately unit test cases failed because it could still add 2 cores for the first coordinator node (1st from collection creation with default naming scheme, and then 2nd from addReplica call with a different name)

For example this is the list of replicas on a failed run of TestCoordinatorRole#testConcurrentAccess which is supposed to only create 4 cores, one of each for the 4 coordinator nodes:

core_node2:{
  "core":".sys.COORDINATOR-COLL-conf_shard1_replica_n1",
  "leader":"true",
  "node_name":"127.0.0.1:49656_solr",
  "base_url":"https://127.0.0.1:49656/solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, 
core_node3:{
  "node_name":"127.0.0.1:49656_solr",
  "base_url":"https://127.0.0.1:49656/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49656_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, 
core_node4:{
  "node_name":"127.0.0.1:49647_solr",
  "base_url":"https://127.0.0.1:49647/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49647_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, 
core_node5:{
  "node_name":"127.0.0.1:49650_solr",
  "base_url":"https://127.0.0.1:49650/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49650_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}, 
core_node6:{
  "node_name":"127.0.0.1:49653_solr",
  "base_url":"https://127.0.0.1:49653/solr",
  "core":".sys.COORDINATOR-COLL-conf_127.0.0.1_49653_solr",
  "state":"active",
  "type":"NRT",
  "force_set_state":"false"}]

Solution

Instead of restricting the core name, which is hard to get it right, perhaps we can synchronize the replica block. This block should be rarely called - only once per collection on first query after node start, and the replica creation is even less frequent - only very first time on a coordinator node that encounters a new config. So I think it's probably better to simply synchronize the block.

Tests

Re-ran the test cases 10 times and ensure that they all passed
./gradlew :solr:core:beast -Ptests.dups=10 --tests "org.apache.solr.search.TestCoordinatorRole.testConcurrentAccess"

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@noblepaul noblepaul merged commit 9cdf0e4 into apache:main Jul 19, 2023
2 checks passed
patsonluk added a commit to cowpaths/fullstory-solr that referenced this pull request Aug 1, 2023
patsonluk added a commit to cowpaths/fullstory-solr that referenced this pull request Aug 3, 2023
justinrsweeney pushed a commit to cowpaths/fullstory-solr that referenced this pull request Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants