Do not tolerate exceptions while storing the serialized network #946

trevorgerhardt · 2024-10-24T01:29:35Z

fileStorage.moveIntoStorage() failed recently and the reason for failure was hidden until we logged into the server and checked the logs.

This PR removes the try/catch block in order to allow exceptions here to bubble up to the user during network building and saving.

Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.

ansoncfit · 2024-10-25T15:28:53Z

Thanks for the fix. In local testing, I confirmed that an exception here thrown by a worker makes its way to a sensible error message shown in the analysis panel of the UI.

Because we were just talking about obscure Kryo serialization errors, I was wondering if there are harmless errors that occur on production. If so and we want to tolerate them, should we restrict the catch block to KryoException e instead of removing the try/catch block entirely? What do you think, @abyrd?

abyrd · 2024-10-28T18:42:41Z

I think that tolerating exceptions here was intentional, as indicated in the comment "Tolerate exceptions here as we do have a network to return, we just failed to cache it."

We may have been encountering exceptions when copying files to S3 for example, which could even be due to problems with AWS that were out of control. I think this used to happen fairly regularly for reasons we could never pin down. In such a situation, the machine would have been able to continue doing calculations but just couldn't cache the network file for reuse by other machines in the future.

If we're no longer experiencing regular inexplicable S3 transfer problems, then maybe we don't need to tolerate any exceptions here and it's better to fail fast and very visibly. I'll give it some more thought tomorrow and provide a review.

abyrd · 2024-11-01T05:26:33Z

Looking at this again, my sense is that priorities are just different when testing a new feature vs. normal usage. It is more convenient in testing if the error is immediately visible and fails fast, aborting the calling code. However, in production it's more convenient if arcane details like network or other cloud provider glitches do not prevent the user from making use of the TransportNetwork that was just successfully built, potentially after a long wait.

If the problem was most recently encountered in testing, then the fail-fast behavior might seem better. But do we want end users to experience that behavior?

Do we have a way to establish whether S3 transfer problems have become rare or nonexistent? If so, then the question can be sidestepped. We can switch to the new fail-fast behavior without imposing it on end users.

Probably the ideal is for all methods to employ a more sophisticated return type that can accumulate errors and warnings alongside any return value, so no part of the system has to choose between returning a value and failing (they can always do both at once). But that's a major change that would provide benefit only if applied broadly throughout the system.

ansoncfit · 2024-11-01T12:59:09Z

Some S3 download problems may still be lurking (such as #832), but I don't recall specific instances of other upload problems.

I just thought of another option: tolerate the exceptions but log them to Slack.

trevorgerhardt · 2024-11-02T09:17:22Z

If AWS S3 upload is failing that seems like a serious issue to our production system that we should be aware of, for whatever reason that might have caused it.

Logging to Slack is certainly an option, but do we handle any other exceptions or errors in that way?

Throw exceptions from buildNetwork

8d6bb4c

Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.

trevorgerhardt enabled auto-merge October 24, 2024 01:30

trevorgerhardt requested review from abyrd and ansoncfit October 25, 2024 06:06

trevorgerhardt added the bug label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not tolerate exceptions while storing the serialized network #946

Do not tolerate exceptions while storing the serialized network #946

trevorgerhardt commented Oct 24, 2024

ansoncfit commented Oct 25, 2024 •

edited

Loading

abyrd commented Oct 28, 2024

abyrd commented Nov 1, 2024

ansoncfit commented Nov 1, 2024

trevorgerhardt commented Nov 2, 2024

Do not tolerate exceptions while storing the serialized network #946

Are you sure you want to change the base?

Do not tolerate exceptions while storing the serialized network #946

Conversation

trevorgerhardt commented Oct 24, 2024

ansoncfit commented Oct 25, 2024 • edited Loading

abyrd commented Oct 28, 2024

abyrd commented Nov 1, 2024

ansoncfit commented Nov 1, 2024

trevorgerhardt commented Nov 2, 2024

ansoncfit commented Oct 25, 2024 •

edited

Loading