[Enhancement] only call getAliveComputeNodes once per OlapScanNode (backport #52168) #52267

mergify · 2024-10-24T04:35:25Z

Why I'm doing:

I found some queries which were slow (order of 3-5 seconds) which were bottlenecked in the frontend. Their query profiles indicated that much of the query execution time was spent planning. I did some jstack profiling of the frontends while sending this type of query, see jstack_example9.txt for an example of the profile. The takeaway is that the large majority of threads were busy doing WarehouseManager.getAliveComputeNodes from OlapScanNode.addScanRangeLocations, just to check if there are any living compute nodes. This is done once per PhysicalPartition, even though the check for living CN is not parameterized by anything other than warehouse id. This is wasteful and seriously slow when there are large partition/tablet counts. We can eliminate this bottleneck.

What I'm doing:

Ensuring that getAliveComputeNodes is called once per instance of OlapScanNode (once per query).

This can be seen as a follow up to #46913

Fixes #issue

What type of PR is this:

This is a fix for a performance issue, which I'll call an enhancement.

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

This is an automatic backport of pull request #52168 done by [Mergify](https://mergify.com). ## Why I'm doing:

I found some queries which were slow (order of 3-5 seconds) which were bottlenecked in the frontend. Their query profiles indicated that much of the query execution time was spent planning. I did some jstack profiling of the frontends while sending this type of query, see jstack_example9.txt for an example of the profile. The takeaway is that the large majority of threads were busy doing WarehouseManager.getAliveComputeNodes from OlapScanNode.addScanRangeLocations, just to check if there are any living compute nodes. This is done once per PhysicalPartition, even though the check for living CN is not parameterized by anything other than warehouse id. This is wasteful and seriously slow when there are large partition/tablet counts. We can eliminate this bottleneck.

What I'm doing:

Ensuring that getAliveComputeNodes is called once per instance of OlapScanNode (once per query).

This can be seen as a follow up to #46913

Fixes #issue

What type of PR is this:

This is a fix for a performance issue, which I'll call an enhancement.

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

…52168) Signed-off-by: Connor Brennan <cbrennan@pinterest.com> (cherry picked from commit f581449)

sonarcloud · 2024-10-24T04:45:24Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

[Enhancement] only call getAliveComputeNodes once per OlapScanNode (#…

7600dbe

…52168) Signed-off-by: Connor Brennan <cbrennan@pinterest.com> (cherry picked from commit f581449)

mergify bot mentioned this pull request Oct 24, 2024

[Enhancement] only call getAliveComputeNodes once per OlapScanNode #52168

Merged

24 tasks

github-actions bot added the automerge label Oct 24, 2024

wanpengfei-git enabled auto-merge (squash) October 24, 2024 04:36

wyb approved these changes Oct 24, 2024

View reviewed changes

wanpengfei-git merged commit 6308edb into branch-3.3 Oct 24, 2024
34 of 35 checks passed

wanpengfei-git deleted the mergify/bp/branch-3.3/pr-52168 branch October 24, 2024 06:19

github-actions bot added the version:3.3.6 label Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] only call getAliveComputeNodes once per OlapScanNode (backport #52168) #52267

[Enhancement] only call getAliveComputeNodes once per OlapScanNode (backport #52168) #52267

mergify bot commented Oct 24, 2024 •

edited by wanpengfei-git

Loading

sonarcloud bot commented Oct 24, 2024

[Enhancement] only call getAliveComputeNodes once per OlapScanNode (backport #52168) #52267

[Enhancement] only call getAliveComputeNodes once per OlapScanNode (backport #52168) #52267

Conversation

mergify bot commented Oct 24, 2024 • edited by wanpengfei-git Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

What I'm doing:

What type of PR is this:

Checklist:

sonarcloud bot commented Oct 24, 2024

Quality Gate passed

mergify bot commented Oct 24, 2024 •

edited by wanpengfei-git

Loading