SOLR-16879: add queue for expensive admin operations #1761

psalagnac · 2023-07-06T22:13:47Z

https://issues.apache.org/jira/browse/SOLR-16879

Description

Solution

Add queue for expensive admin operations

add a flag to core admin operations to be marked as expensive. For now, only backup and restore are expensive, this may be extended.
in CoreAdminHandler, we count the number of in-flight expensive operations. If more than the limit (currently 5 by default) are already in-flight, we don't submit any new ones to the thread pool, but we add them into a queue.
each time an expensive operation is completed, it starts the following one from the queue, if any.

Tests

Added in CoreAdminHandlerTest

Checklist

Please review the following and check all that apply:

I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
I have created a Jira issue and added the issue ID to my pull request title.
I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
I have developed this patch against the main branch.
I have run ./gradlew check.
I have added tests for my changes.
I have added documentation for the Reference Guide

solr/core/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java

dsmiley

Just some little comments but overall looks good

solr/core/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java

dsmiley · 2023-08-03T03:35:17Z

Once solr/CHANGES.txt is updated, I think it's ready to merge

psalagnac · 2023-08-03T23:46:41Z

While testing this internally, I found a minor issue (will push a fix soon). The flag to set some operation as expensive is not correctly passed.
I'd rather we wait more before merging.

psalagnac · 2023-08-07T09:24:51Z

Hi @dsmiley,
I think this change is now ready to be merged.
Thanks

dsmiley

Any thoughts on how this bug slipped by our notice?

dsmiley

Just an idea I want to write down...

One way to look at this problem, is that maybe "async" should only be used for potentially expensive things. If we agreed on this, then the essential change would only be to make CoreAdminHandler.parallelExecutor's thread pool configurable. Secondarily, any place within Solr a core admin request is made, we would ask ourselves, is this cheap or expensive? Today, the SolrCloud commands (which in turn invoke these core commands) often use async or not async based on its caller's choice (i.e. the user) but (A) this is harder -- sometimes dual code paths, and (B) I think internally the commands know what's potentially expensive, and need not internally use async just because async is used at a high level. In other words, just because a user might choose to create a collection in the async style doesn't mean we are compelled to internally use async for the core admin operations to complete the task.

dsmiley · 2023-08-07T17:13:07Z

solr/core/src/java/org/apache/solr/cloud/api/collections/CollectionHandlingUtils.java

@@ -506,30 +507,23 @@ private static NamedList<Object> waitForCoreAdminAsyncCallToComplete(
          }

          String r = (String) srsp.getSolrResponse().getResponse().get("STATUS");
-          if (r.equals("running")) {


Why don't we change this to a switch statement? Even IntelliJ is recommending this :-) It seems an underlying bug here would have been averted had a switch statement been used as it sort of forces you to consider all the enum values.

We don't check an enum, but string values.
I often feel a switch statement on strings is error prone, and doesn't protect from regressions.
What do you think?

I'm suggesting the switch cases should be on the enum instances, not string literals.

I may miss something. Do you suggest to introduce a new enum for this? I don't think there is one for request statuses.
Here PENDING is just a constant string.

org.apache.solr.client.solrj.response.RequestStatusState but it looks like I confused collection async status with core async status. Too bad they don't use the same enum!

stillalex · 2023-08-08T20:52:14Z

read through this, I'm not really familiar with any of this code but the change looks good to me.
minor question. why not add another executor (fixed number of threads similar to parallelExecutor) for these expensive operations? all the code that ensures the max number of running threads would no longer be necessary, the finishTask method would also be a bit simpler.

another one for my personal understanding: when the status of a task is reported as RUNNING it could either be queued or running. this means 'sent to executor', but there is no guarantee this will run immediately. introducing another status as PENDING might cause some confusion to what is actually running at a given moment.

psalagnac · 2023-08-09T13:43:09Z

Any thoughts on how this bug slipped by our notice?

I was unable to write a unit that actually do an expensive operation (backup or restore). The low-level queue mechanism is well tested, but I'm not sure how to write an end-to-end test with expensive operations at scale.

read through this, I'm not really familiar with any of this code but the change looks good to me. minor question. why not add another executor (fixed number of threads similar to parallelExecutor) for these expensive operations? all the code that ensures the max number of running threads would no longer be necessary, the finishTask method would also be a bit simpler.

Yes, that would be another option.
This is at the cost of having many dormant threads most of the time. Also, manually handling a queue for expensive tasks allow eventual future improvements.
I don't have any strong opinion.

another one for my personal understanding: when the status of a task is reported as RUNNING it could either be queued or running. this means 'sent to executor', but there is no guarantee this will run immediately. introducing another status as PENDING might cause some confusion to what is actually running at a given moment.

You're correct, RUNNING status means 'sent to executor'.
For most of the tasks, I don't expect the time spend by the task in the executor queue to be significant (otherwise, we would have to mark something else as expensive if it keeps the executor busy for a while).

I added the PENDING status to make this difference. Expensive tasks can be in the queue for seconds/minutes for big collections, so I think there is some value for the caller to know that at some point.
(note that the caller is the overseer only. The top level task with user specified asyncId does not expose this pending status)

dsmiley · 2023-08-09T18:52:33Z

I like Alex's idea of a second thread pool / executor; I think it would mean we would not need the explicit expensiveTaskQueue that is fiddly to work with and could rely on the more familiar Executor (that has an impl using a queue but it's not our code to worry about). But I don't have a strong opinion.

This is at the cost of having many dormant threads most of the time.

org.apache.solr.common.util.ExecutorUtil#newMDCAwareCachedThreadPool(int, java.util.concurrent.ThreadFactory) will not keep unused threads around long; only a minute.

dsmiley · 2023-08-31T18:30:16Z

Closed in favor of #1864

Pierre Salagnac added 3 commits July 6, 2023 23:03

Move action name to TaskObject

9d1a80c

Move callable in task object

59c211f

Queue for expensive admin tasks

5257bb0

sonatype-lift bot reviewed Jul 6, 2023

View reviewed changes

solr/core/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java Show resolved Hide resolved

dsmiley approved these changes Jul 7, 2023

View reviewed changes

psalagnac changed the title ~~Admin queue~~ SOLR-16879: add queue for expensive admin operations Jul 7, 2023

Pierre Salagnac added 5 commits July 7, 2023 17:56

Removed useless synchronized list

ee891f8

Use var keywork to reduce line length

86b3d54

Add comment

420c77c

Update changelog

c808703

Merge branch 'main' into admin-queue

602268e

psalagnac marked this pull request as ready for review July 7, 2023 16:32

bruno-roustant reviewed Jul 10, 2023

View reviewed changes

solr/core/src/java/org/apache/solr/handler/admin/CoreAdminHandler.java Outdated Show resolved Hide resolved

Pierre Salagnac added 2 commits July 12, 2023 21:10

Replace if by while, in case of spurious wakeup.

2306d0c

Replace queue type (build warning)

205109c

Pierre Salagnac added 2 commits August 7, 2023 11:21

Propagate expensive flag of admin operations

be4945f

Merge branch 'main' into admin-queue

a56b08c

Overseer waits for 'pending' core requests

be15c37

dsmiley reviewed Aug 7, 2023

View reviewed changes

psalagnac mentioned this pull request Aug 24, 2023

SOLR-16879: add dedicated thread pool for expensive admin operations #1864

Merged

7 tasks

dsmiley closed this Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOLR-16879: add queue for expensive admin operations #1761

SOLR-16879: add queue for expensive admin operations #1761

psalagnac commented Jul 6, 2023 •

edited

Loading

dsmiley left a comment

dsmiley commented Aug 3, 2023

psalagnac commented Aug 3, 2023

psalagnac commented Aug 7, 2023

dsmiley left a comment

dsmiley left a comment

dsmiley Aug 7, 2023

psalagnac Aug 9, 2023

dsmiley Aug 9, 2023

psalagnac Aug 9, 2023

dsmiley Aug 9, 2023

stillalex commented Aug 8, 2023

psalagnac commented Aug 9, 2023

dsmiley commented Aug 9, 2023

dsmiley commented Aug 31, 2023

SOLR-16879: add queue for expensive admin operations #1761

SOLR-16879: add queue for expensive admin operations #1761

Conversation

psalagnac commented Jul 6, 2023 • edited Loading

Description

Solution

Tests

Checklist

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley commented Aug 3, 2023

psalagnac commented Aug 3, 2023

psalagnac commented Aug 7, 2023

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley left a comment

Choose a reason for hiding this comment

dsmiley Aug 7, 2023

Choose a reason for hiding this comment

psalagnac Aug 9, 2023

Choose a reason for hiding this comment

dsmiley Aug 9, 2023

Choose a reason for hiding this comment

psalagnac Aug 9, 2023

Choose a reason for hiding this comment

dsmiley Aug 9, 2023

Choose a reason for hiding this comment

stillalex commented Aug 8, 2023

psalagnac commented Aug 9, 2023

dsmiley commented Aug 9, 2023

dsmiley commented Aug 31, 2023

psalagnac commented Jul 6, 2023 •

edited

Loading