Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to finalize action: Missing digest: <hash>/<len> for ...jdeps #22854

Closed
rbeasley-avgo opened this issue Jun 21, 2024 · 8 comments
Closed
Assignees
Labels
more data needed team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged

Comments

@rbeasley-avgo
Copy link

Description of the bug:

Since upgrading to Bazel 7, we've encountered numerous sporadic build failures. Most are covered by other GitHub issues, but AFAICT nobody's filed one about .jdeps files.

I am going to experiment with --noexperimental_inmemory_jdeps_files.

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Unknown.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.2.0-vmware

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

This is just Bazel 7.2.0 with a handful of patches for PRs that are either outstanding or have been rejected. None are related to scheduling, remote caching, etc.

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

Our RBE implementation is Buildfarm.

  • I haven't ruled out the possibility that Buildfarm is misbehaving. However,
    • We didn't run into failures like this when still on 6.4.0.
    • I was hoping that the retry flags specified below would allow us to recover from transfer failures.
  • Only RBE workers are allowed to write to Buildfarm's CAS (--noremote_upload_local_results).
    • Is this flag inherently incompatible w/ experimental_inmemory_foo flags?
    • We're using dynamic execution. If the flags are incompatible, then is it possible that dynamic execution contributes to the nondeterministic nature of the failures?

We're using the following options:

# RBE-related flags
--remote_download_outputs=all
--internal_spawn_scheduler
--spawn_strategy=dynamic
--dynamic_local_strategy=worker,sandboxed,local
--remote_retries=5
--experimental_remote_cache_eviction_retries=5
--verbose_failures
--remote_cache=
--disk_cache=
--noremote_upload_local_results
--experimental_remote_cache_async
--experimental_remote_merkle_tree_cache
--remote_local_fallback
--remote_local_fallback_strategy=sandboxed
--experimental_remote_downloader_local_fallback
--remote_cache_compression

# Workaround for https://github.com/bazelbuild/bazel/issues/22387 .
build --noexperimental_inmemory_dotd_files

In failing builds w/ this syndrome, java.log contains backtraces resembling the following

com.google.devtools.build.lib.remote.common.BulkTransferException: Missing digest: HASH/LEN for LABEL.jdeps
        at com.google.devtools.build.lib.remote.util.Utils.lambda$mergeBulkTransfer$4(Utils.java:656)
        at com.google.common.util.concurrent.CombinedFuture$AsyncCallableInterruptibleTask.runInterruptibly(CombinedFuture.java:165)
        at com.google.common.util.concurrent.CombinedFuture$AsyncCallableInterruptibleTask.runInterruptibly(CombinedFuture.java:153)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.CombinedFuture$CombinedFutureInterruptibleTask.execute(CombinedFuture.java:108)
        at com.google.common.util.concurrent.CombinedFuture.handleAllCompleted(CombinedFuture.java:65)
        at com.google.common.util.concurrent.AggregateFuture.processCompleted(AggregateFuture.java:301)
        at com.google.common.util.concurrent.AggregateFuture.decrementCountAndMaybeComplete(AggregateFuture.java:283)
        at com.google.common.util.concurrent.AggregateFuture.lambda$init$1(AggregateFuture.java:181)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
        at com.google.devtools.build.lib.remote.util.RxFutures$1.onError(RxFutures.java:221)
        at io.reactivex.rxjava3.internal.operators.completable.CompletableFromSingle$CompletableFromSingleObserver.onError(CompletableFromSingle.java:41)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$1.onError(AsyncTaskCache.java:339)
        at com.google.devtools.build.lib.remote.util.AsyncTaskCache$Execution.onError(AsyncTaskCache.java:205)
        at io.reactivex.rxjava3.internal.operators.completable.CompletableToSingle$ToSingle.onError(CompletableToSingle.java:73)
        at io.reactivex.rxjava3.internal.operators.completable.CompletableUsing$UsingObserver.onError(CompletableUsing.java:165)
        at io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onError(CompletablePeek.java:95)
        at io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onError(CompletablePeek.java:95)
        at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.tryOnError(CompletableCreate.java:91)
        at io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onError(CompletableCreate.java:77)
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onFailure(RxFutures.java:102)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
        at com.google.devtools.build.lib.remote.RemoteCache$3.onFailure(RemoteCache.java:381)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:850)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:125)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)                                                                                                                                                          at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:105)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:850)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:216)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:192)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:144)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:850)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:216)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:192)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:144)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:105)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
        at com.google.devtools.build.lib.remote.util.RxFutures$2.onError(RxFutures.java:259)
        at io.reactivex.rxjava3.internal.operators.single.SingleFlatMap$SingleFlatMapCallback$FlatMapSingleObserver.onError(SingleFlatMap.java:117)
        at io.reactivex.rxjava3.internal.operators.single.SingleUsing$UsingSingleObserver.onError(SingleUsing.java:180)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.tryOnError(SingleCreate.java:95)
        at io.reactivex.rxjava3.internal.operators.single.SingleCreate$Emitter.onError(SingleCreate.java:81)
        at com.google.devtools.build.lib.remote.util.RxFutures$OnceSingleOnSubscribe$1.onFailure(RxFutures.java:172)
        at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1119)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807)
        at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:55)
        at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:453)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.devtools.build.lib.remote.NetworkTimeInterceptor$NetworkTimeCall$1.onClose(NetworkTimeInterceptor.java:81)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.devtools.build.lib.remote.logging.LoggingInterceptor$LoggingForwardingCall$1.onClose(LoggingInterceptor.java:157)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
@meisterT
Copy link
Member

meisterT commented Jul 2, 2024

If possible, please provide a repro since it is unclear to us how it could happen and a repro would help making progress.

@rbeasley-avgo
Copy link
Author

If possible, please provide a repro since it is unclear to us how it could happen and a repro would help making progress.

Believe me, I'm trying. :) In the meantime, are there any other artifacts that could help w/ post-mortem debugging (e.g. --remote_grpc_log, java.log)? I'm happy to configure our builds to collect more information, sanitize it, and share here. If not, no worries; I'll do what I can to repeat the failure and home in on a repro case.

@tjgq
Copy link
Contributor

tjgq commented Jul 3, 2024

I suspect this might be the same as #22387 because .d and .jdeps files are handled similarly by Bazel. Does setting --noexperimental_inmemory_jdeps_files make the issue go away?

The --experimental_remote_grpc_log would be useful. Feel free to sanitize file names, input arguments, etc, but please leave the digests intact (or rewrite them in a consistent manner) so they can be correlated across requests.

@rbeasley-avgo
Copy link
Author

I suspect this might be the same as #22387 because .d and .jdeps files are handled similarly by Bazel. Does setting --noexperimental_inmemory_jdeps_files make the issue go away?

Yes, this goes away when using --noexperimental_inmemory_jdeps_files. We've had no such failures since adopting that flag.

@tjgq
Copy link
Contributor

tjgq commented Jul 23, 2024

@rbeasley-avgo Can you provide either a repro, or an --experimental_remote_grpc_log for a build exhibiting this failure? Otherwise it's going to be difficult to make progress on this.

@rbeasley-avgo
Copy link
Author

@rbeasley-avgo Can you provide either a repro, or an --experimental_remote_grpc_log for a build exhibiting this failure? Otherwise it's going to be difficult to make progress on this.

Apologies for the radio silence. Was on PTO.

I haven't been able to generate a repro, so instead I'm just waiting for the west coast to wake up to review a change that removes the --noexperimental_inmemory_foo flags from our builds. Once I have a few failures, I'll collect the gRPC logs, convert to plaintext (using bazelbuild/tools_remote), sanitize, and share with you.

@rbeasley-avgo
Copy link
Author

@rbeasley-avgo Can you provide either a repro, or an --experimental_remote_grpc_log for a build exhibiting this failure? Otherwise it's going to be difficult to make progress on this.

Apologies for the radio silence. Was on PTO.

I haven't been able to generate a repro, so instead I'm just waiting for the west coast to wake up to review a change that removes the --noexperimental_inmemory_foo flags from our builds. Once I have a few failures, I'll collect the gRPC logs, convert to plaintext (using bazelbuild/tools_remote), sanitize, and share with you.

Just writing to let folks know that I haven't forgotten about this. I removed the --noexperimental_inmemory_{dotd,jdeps}_files options from our builds on 2024-07-23, but I still haven't encountered any related failure.

This may be a red herring, but I'll share anyway in case anyone else observes a similar correlation. These failures happened to coincide with a degraded internal RBE deployment, where we also observed remotely executing actions hanging indefinitely. I'm not on the RBE team, so I'm doing a lot of handwaving and uncritical repeating. Our RBE service is backed by Bazel Buildfarm. As put by the RBE team,

RBE builds are hanging in <our build harness>.
    - Dispatched operations are long lived
        - RBE mangers are crashing
           - There is a race condition on an internal data structure that stores currently registered workers
               - The race condition is occurring now because servers are seeing a different number of workers
                  - Redis is giving different results when requesting currently registered workers
                     - Redis keys have diverged
                       - The following nodes are unhealthy <master> <--- <replica>
                         - ...

Our RBE team resolved this by redeploying the Redis cluster from scratch.

Other tweaks I've had to made (to avoid or improve diagnostics involving other Bazel crashes) are

I'm sorry that I don't have anything more useful to share. :(

@coeuvre
Copy link
Member

coeuvre commented Oct 1, 2024

Since you were using redis as HTTP cache and removing --noexperimental_inmemory_{dotd,jdeps}_files doesn't reintroduce the error, it's probably a special case of #18696 which is fixed for Bazel 7.4+.

@coeuvre coeuvre closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug untriaged
Projects
None yet
Development

No branches or pull requests

8 participants