-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not tolerate exceptions while storing the serialized network #946
base: dev
Are you sure you want to change the base?
Conversation
Remove try/catch block that did not allow exceptions to bubble up to the user during network building and saving.
Thanks for the fix. In local testing, I confirmed that an exception here thrown by a worker makes its way to a sensible error message shown in the analysis panel of the UI. Because we were just talking about obscure Kryo serialization errors, I was wondering if there are harmless errors that occur on production. If so and we want to tolerate them, should we restrict the catch block to |
I think that tolerating exceptions here was intentional, as indicated in the comment "Tolerate exceptions here as we do have a network to return, we just failed to cache it." We may have been encountering exceptions when copying files to S3 for example, which could even be due to problems with AWS that were out of control. I think this used to happen fairly regularly for reasons we could never pin down. In such a situation, the machine would have been able to continue doing calculations but just couldn't cache the network file for reuse by other machines in the future. If we're no longer experiencing regular inexplicable S3 transfer problems, then maybe we don't need to tolerate any exceptions here and it's better to fail fast and very visibly. I'll give it some more thought tomorrow and provide a review. |
Looking at this again, my sense is that priorities are just different when testing a new feature vs. normal usage. It is more convenient in testing if the error is immediately visible and fails fast, aborting the calling code. However, in production it's more convenient if arcane details like network or other cloud provider glitches do not prevent the user from making use of the TransportNetwork that was just successfully built, potentially after a long wait. If the problem was most recently encountered in testing, then the fail-fast behavior might seem better. But do we want end users to experience that behavior? Do we have a way to establish whether S3 transfer problems have become rare or nonexistent? If so, then the question can be sidestepped. We can switch to the new fail-fast behavior without imposing it on end users. Probably the ideal is for all methods to employ a more sophisticated return type that can accumulate errors and warnings alongside any return value, so no part of the system has to choose between returning a value and failing (they can always do both at once). But that's a major change that would provide benefit only if applied broadly throughout the system. |
Some S3 download problems may still be lurking (such as #832), but I don't recall specific instances of other upload problems. I just thought of another option: tolerate the exceptions but log them to Slack. |
If AWS S3 upload is failing that seems like a serious issue to our production system that we should be aware of, for whatever reason that might have caused it. Logging to Slack is certainly an option, but do we handle any other exceptions or errors in that way? |
fileStorage.moveIntoStorage()
failed recently and the reason for failure was hidden until we logged into the server and checked the logs.This PR removes the try/catch block in order to allow exceptions here to bubble up to the user during network building and saving.