-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bazel CI Flaky Test: //src/test/shell/bazel:starlark_repository_test (test_download_failure_message) #21238
Comments
Collected some jstack log: https://gist.github.com/meteorcloudy/a7b812947d3e328daa5c8015a2d4ac2e |
This is not reproducible if I add |
Is the setup "multiple Bazel instances" or "multiple Bazel builds running against the same server"? If the latter, this could be caused by https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/bazel/repository/downloader/HttpDownloader.java;l=53;bpv=1;bpt=1?q=Httpdownloader&ss=bazel%2Fbazel. |
So if we don't set |
@justinhorvitz I may need some help here with the Skyframe internals -- this is a deadlock that shows up with the usage of the Loom+repo fetching stuff, for context. This line in the jstack output looks particularly suspect: https://gist.github.com/meteorcloudy/a7b812947d3e328daa5c8015a2d4ac2e#file-jstack-bazel-log-L153 Looks like ParallelEvaluator.bubbleErrorUp is calling SkyFunction.compute in a somewhat special way. I tried digging around but got a bit lost. Justin, if this rings a bell to you, I'd appreciate some tips; otherwise I'll keep digging in a bit. |
Guess: maybe related to skyframe attempting to interrupt the computation in error bubbling: https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/skyframe/SkyFunctionEnvironment.java;l=597-600;drc=6cf19c71d086d2e83adef200e84830728eae1b21. |
@Wyverald Should we mark this as a blocker for 7.1.0? It looks like this could cause hangs in production. |
I think so, at least the test flak also exists on releaes-7.1.0 |
@bazel-io fork 7.1.0 |
There is a fix being submitted |
…ror bubbling For some reason, using worker threads for repo fetching during Skyframe error bubbling frequently causes deadlocks on Linux. I wasn't able to find out why the deadlock happens, but this CL is the immediate solution to the problem, and shouldn't be a performance concern since no Skyframe restarts should happen during error bubbling anyway. Tested on Linux; with this CL, `bazel test //src/test/shell/bazel:starlark_repository_test --test_filter=test_download_failure_message --runs_per_test=20` finishes just fine. (On an M1 macbook, I can't trigger the deadlock even without this CL.) Fixes #21238 PiperOrigin-RevId: 606305306 Change-Id: I6f47a144b29030011f6c10c2b37f6874190fed0e
#21305) …ror bubbling For some reason, using worker threads for repo fetching during Skyframe error bubbling frequently causes deadlocks on Linux. I wasn't able to find out why the deadlock happens, but this CL is the immediate solution to the problem, and shouldn't be a performance concern since no Skyframe restarts should happen during error bubbling anyway. Tested on Linux; with this CL, `bazel test //src/test/shell/bazel:starlark_repository_test --test_filter=test_download_failure_message --runs_per_test=20` finishes just fine. (On an M1 macbook, I can't trigger the deadlock even without this CL.) Fixes #21238 PiperOrigin-RevId: 606305306 Change-Id: I6f47a144b29030011f6c10c2b37f6874190fed0e
Description of the bug:
This test often timeout in Bazel postsubmit: https://buildkite.com/bazel/bazel-bazel/builds/26661#018d837a-6b44-4938-be56-d6bf3c695381
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
This can be easily reproduced within docker image
gcr.io/bazel-public/centos7-java11-devtoolset10
byIncreasing the number of
--runs_per_test
will increase the chance of reproducing this issue.Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?7.0.2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: