Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16879: add queue for expensive admin operations #1761

Closed
wants to merge 13 commits into from

Conversation

psalagnac
Copy link
Contributor

@psalagnac psalagnac commented Jul 6, 2023

https://issues.apache.org/jira/browse/SOLR-16879

Description

Solution

Add queue for expensive admin operations

  • add a flag to core admin operations to be marked as expensive. For now, only backup and restore are expensive, this may be extended.
  • in CoreAdminHandler, we count the number of in-flight expensive operations. If more than the limit (currently 5 by default) are already in-flight, we don't submit any new ones to the thread pool, but we add them into a queue.
  • each time an expensive operation is completed, it starts the following one from the queue, if any.

Tests

Added in CoreAdminHandlerTest

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some little comments but overall looks good

@psalagnac psalagnac changed the title Admin queue SOLR-16879: add queue for expensive admin operations Jul 7, 2023
@psalagnac psalagnac marked this pull request as ready for review July 7, 2023 16:32
@dsmiley
Copy link
Contributor

dsmiley commented Aug 3, 2023

Once solr/CHANGES.txt is updated, I think it's ready to merge

@psalagnac
Copy link
Contributor Author

While testing this internally, I found a minor issue (will push a fix soon). The flag to set some operation as expensive is not correctly passed.
I'd rather we wait more before merging.

@psalagnac
Copy link
Contributor Author

Hi @dsmiley,
I think this change is now ready to be merged.
Thanks

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on how this bug slipped by our notice?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea I want to write down...

One way to look at this problem, is that maybe "async" should only be used for potentially expensive things. If we agreed on this, then the essential change would only be to make CoreAdminHandler.parallelExecutor's thread pool configurable. Secondarily, any place within Solr a core admin request is made, we would ask ourselves, is this cheap or expensive? Today, the SolrCloud commands (which in turn invoke these core commands) often use async or not async based on its caller's choice (i.e. the user) but (A) this is harder -- sometimes dual code paths, and (B) I think internally the commands know what's potentially expensive, and need not internally use async just because async is used at a high level. In other words, just because a user might choose to create a collection in the async style doesn't mean we are compelled to internally use async for the core admin operations to complete the task.

@@ -506,30 +507,23 @@ private static NamedList<Object> waitForCoreAdminAsyncCallToComplete(
}

String r = (String) srsp.getSolrResponse().getResponse().get("STATUS");
if (r.equals("running")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we change this to a switch statement? Even IntelliJ is recommending this :-) It seems an underlying bug here would have been averted had a switch statement been used as it sort of forces you to consider all the enum values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't check an enum, but string values.
I often feel a switch statement on strings is error prone, and doesn't protect from regressions.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting the switch cases should be on the enum instances, not string literals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may miss something. Do you suggest to introduce a new enum for this? I don't think there is one for request statuses.
Here PENDING is just a constant string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.solr.client.solrj.response.RequestStatusState but it looks like I confused collection async status with core async status. Too bad they don't use the same enum!

@stillalex
Copy link
Member

read through this, I'm not really familiar with any of this code but the change looks good to me.
minor question. why not add another executor (fixed number of threads similar to parallelExecutor) for these expensive operations? all the code that ensures the max number of running threads would no longer be necessary, the finishTask method would also be a bit simpler.

another one for my personal understanding: when the status of a task is reported as RUNNING it could either be queued or running. this means 'sent to executor', but there is no guarantee this will run immediately. introducing another status as PENDING might cause some confusion to what is actually running at a given moment.

@psalagnac
Copy link
Contributor Author

Any thoughts on how this bug slipped by our notice?

I was unable to write a unit that actually do an expensive operation (backup or restore). The low-level queue mechanism is well tested, but I'm not sure how to write an end-to-end test with expensive operations at scale.

read through this, I'm not really familiar with any of this code but the change looks good to me. minor question. why not add another executor (fixed number of threads similar to parallelExecutor) for these expensive operations? all the code that ensures the max number of running threads would no longer be necessary, the finishTask method would also be a bit simpler.

Yes, that would be another option.
This is at the cost of having many dormant threads most of the time. Also, manually handling a queue for expensive tasks allow eventual future improvements.
I don't have any strong opinion.

another one for my personal understanding: when the status of a task is reported as RUNNING it could either be queued or running. this means 'sent to executor', but there is no guarantee this will run immediately. introducing another status as PENDING might cause some confusion to what is actually running at a given moment.

You're correct, RUNNING status means 'sent to executor'.
For most of the tasks, I don't expect the time spend by the task in the executor queue to be significant (otherwise, we would have to mark something else as expensive if it keeps the executor busy for a while).

I added the PENDING status to make this difference. Expensive tasks can be in the queue for seconds/minutes for big collections, so I think there is some value for the caller to know that at some point.
(note that the caller is the overseer only. The top level task with user specified asyncId does not expose this pending status)

@dsmiley
Copy link
Contributor

dsmiley commented Aug 9, 2023

I like Alex's idea of a second thread pool / executor; I think it would mean we would not need the explicit expensiveTaskQueue that is fiddly to work with and could rely on the more familiar Executor (that has an impl using a queue but it's not our code to worry about). But I don't have a strong opinion.

This is at the cost of having many dormant threads most of the time.

org.apache.solr.common.util.ExecutorUtil#newMDCAwareCachedThreadPool(int, java.util.concurrent.ThreadFactory) will not keep unused threads around long; only a minute.

@dsmiley
Copy link
Contributor

dsmiley commented Aug 31, 2023

Closed in favor of #1864

@dsmiley dsmiley closed this Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants