Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jumpstart failing in CI runs #1119

Open
HCastano opened this issue Oct 18, 2024 · 10 comments
Open

Jumpstart failing in CI runs #1119

HCastano opened this issue Oct 18, 2024 · 10 comments
Assignees
Labels
Fix something that's not working as intended

Comments

@HCastano
Copy link
Collaborator

Looks like the jumpstart is failing periodically for some of our tests. For example, see this CI run.

This is probably because these tests rely on timeouts. It would be nice to get this behaviour to be more consistent.

@HCastano HCastano added the Fix something that's not working as intended label Oct 18, 2024
@frankiebee
Copy link
Contributor

frankiebee commented Oct 18, 2024

jump start is failing consistently in sdk ci because of timeout and we have a 6 minute timeout

@frankiebee
Copy link
Contributor

we seem to have got it passing now all of the sudden no explanation as to why when we added are exit call (are tests wernt exiting when jump start fails) but thats unrelated 🤷

@mixmix
Copy link
Contributor

mixmix commented Oct 18, 2024

i want to know if it's intermittent. Jump start failing unpredictably will massively hamper our ability to work (+ approve rc's)

@HCastano
Copy link
Collaborator Author

@frankiebee @mixmix while I'm not ruling out that this could be a problem with the jump start itself, for this specific issue it's probably more related to the Rust test success/failure criteria.

jump start is failing consistently in sdk ci because of timeout and we have a 6 minute timeout

I would try and shorten the retry time there. You shouldn't need to wait more than maybe a minute for the jumpstart to happen.

@JesseAbram
Copy link
Member

JesseAbram commented Oct 18, 2024

hmmm interesting, will investigate more next week for what it is worth on the JS side to mitigate until the root cause is found you can retry after 50 blocks if it doesn't work

@frankiebee
Copy link
Contributor

we are retrying every 10 blocks watching logs it seems when jump start succeeds locally it does this within a 3 block period

@frankiebee
Copy link
Contributor

unless theirs a reason we should try for 50? i'm seeing that when it gets retried at 50 blocks it works

@JesseAbram
Copy link
Member

50 blocks is the default time we allow before a retry can happen

@frankiebee
Copy link
Contributor

Got it.
note: sdk time out is now based on blocks and i'm retrying every 50 up to 200 from the first try
note: what it looks like from sdk tests 3 out of 8 trys to jumpstart suceed the first go arouznd the other 5 have to retry.
the longest it's taken to successfully jump start iv'e seen is 103 blocks on ci (thats two retrys) localy are test are pretty consistent of jump start taking about 2 to 3 blocks(more or less 20 to 30 seconds).

@HCastano

I would try and shorten the retry time there. You shouldn't need to wait more than maybe a minute for the jumpstart to happen.

this may be true on are local machines are ci more often then not is hitting the 6 minute mark
some time reports taken from are tests:

2024-10-19T01:28:48.0179808Z     final report:jump-start
2024-10-19T01:28:48.0180505Z     total-time: 336 seconds
2024-10-19T01:28:48.0181153Z     total-block-time: 55 blocks

2024-10-19T01:31:42.0294668Z     final report:jump-start
2024-10-19T01:31:42.0297851Z     total-time: 41 seconds
2024-10-19T01:31:42.0300691Z     total-block-time: 6 blocks

2024-10-19T01:42:42.0172770Z     final report:jump-start
2024-10-19T01:42:42.0173823Z     total-time: 612 seconds
2024-10-19T01:42:42.0174961Z     total-block-time: 103 blocks

2024-10-19T01:44:06.0200925Z     final report:jump-start
2024-10-19T01:44:06.0201906Z     total-time: 27 seconds
2024-10-19T01:44:06.0203046Z     total-block-time: 4 blocks

2024-10-19T01:50:24.0162465Z     final report:jump-start
2024-10-19T01:50:24.0163744Z     total-time: 319 seconds
2024-10-19T01:50:24.0164768Z     total-block-time: 53 blocks

2024-10-19T01:56:42.0170359Z     final report:jump-start
2024-10-19T01:56:42.0171174Z     total-time: 327 seconds
2024-10-19T01:56:42.0172143Z     total-block-time: 54 blocks

2024-10-19T02:02:54.0237231Z     final report:jump-start
2024-10-19T02:02:54.0238198Z     total-time: 322 seconds
2024-10-19T02:02:54.0238731Z     total-block-time: 53 blocks

2024-10-19T02:04:30.0186155Z     final report:jump-start
2024-10-19T02:04:30.0187472Z     total-time: 23 seconds
2024-10-19T02:04:30.0188396Z     total-block-time: 3 blocks

@JesseAbram
Copy link
Member

JesseAbram commented Oct 19, 2024

mmmmm this is interesting with all the evidence here Im leaning towards the CI nodes not being able to keep up with the blocks

  • Would explain the random failures
  • Only appears in CI
  • Only happened when we went from less nodes to more (2 -> 4 in sdk, and 1 -> 4 in core)
  • Happens less in core because it would only rely on the alice chain message since we mock the other ones

This is actually a lot of evidence, Ill try to think of ways to test this and talk to Vi about upping the CI machine monday

could possibly be something in our codebase but leaning to less likely due to the overwhelming evidence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Fix something that's not working as intended
Projects
Status: 🌝 Soon
Development

No branches or pull requests

4 participants