Jumpstart failing in CI runs #1119

HCastano · 2024-10-18T19:08:10Z

Looks like the jumpstart is failing periodically for some of our tests. For example, see this CI run.

This is probably because these tests rely on timeouts. It would be nice to get this behaviour to be more consistent.

frankiebee · 2024-10-18T19:35:14Z

jump start is failing consistently in sdk ci because of timeout and we have a 6 minute timeout

frankiebee · 2024-10-18T19:46:41Z

we seem to have got it passing now all of the sudden no explanation as to why when we added are exit call (are tests wernt exiting when jump start fails) but thats unrelated 🤷

mixmix · 2024-10-18T19:52:29Z

i want to know if it's intermittent. Jump start failing unpredictably will massively hamper our ability to work (+ approve rc's)

HCastano · 2024-10-18T20:04:42Z

@frankiebee @mixmix while I'm not ruling out that this could be a problem with the jump start itself, for this specific issue it's probably more related to the Rust test success/failure criteria.

jump start is failing consistently in sdk ci because of timeout and we have a 6 minute timeout

I would try and shorten the retry time there. You shouldn't need to wait more than maybe a minute for the jumpstart to happen.

JesseAbram · 2024-10-18T20:11:03Z

hmmm interesting, will investigate more next week for what it is worth on the JS side to mitigate until the root cause is found you can retry after 50 blocks if it doesn't work

frankiebee · 2024-10-18T22:13:29Z

we are retrying every 10 blocks watching logs it seems when jump start succeeds locally it does this within a 3 block period

frankiebee · 2024-10-18T22:18:38Z

unless theirs a reason we should try for 50? i'm seeing that when it gets retried at 50 blocks it works

JesseAbram · 2024-10-19T02:19:05Z

50 blocks is the default time we allow before a retry can happen

frankiebee · 2024-10-19T04:35:32Z

Got it.
note: sdk time out is now based on blocks and i'm retrying every 50 up to 200 from the first try
note: what it looks like from sdk tests 3 out of 8 trys to jumpstart suceed the first go arouznd the other 5 have to retry.
the longest it's taken to successfully jump start iv'e seen is 103 blocks on ci (thats two retrys) localy are test are pretty consistent of jump start taking about 2 to 3 blocks(more or less 20 to 30 seconds).

@HCastano

I would try and shorten the retry time there. You shouldn't need to wait more than maybe a minute for the jumpstart to happen.

this may be true on are local machines are ci more often then not is hitting the 6 minute mark
some time reports taken from are tests:

2024-10-19T01:28:48.0179808Z     final report:jump-start
2024-10-19T01:28:48.0180505Z     total-time: 336 seconds
2024-10-19T01:28:48.0181153Z     total-block-time: 55 blocks

2024-10-19T01:31:42.0294668Z     final report:jump-start
2024-10-19T01:31:42.0297851Z     total-time: 41 seconds
2024-10-19T01:31:42.0300691Z     total-block-time: 6 blocks

2024-10-19T01:42:42.0172770Z     final report:jump-start
2024-10-19T01:42:42.0173823Z     total-time: 612 seconds
2024-10-19T01:42:42.0174961Z     total-block-time: 103 blocks

2024-10-19T01:44:06.0200925Z     final report:jump-start
2024-10-19T01:44:06.0201906Z     total-time: 27 seconds
2024-10-19T01:44:06.0203046Z     total-block-time: 4 blocks

2024-10-19T01:50:24.0162465Z     final report:jump-start
2024-10-19T01:50:24.0163744Z     total-time: 319 seconds
2024-10-19T01:50:24.0164768Z     total-block-time: 53 blocks

2024-10-19T01:56:42.0170359Z     final report:jump-start
2024-10-19T01:56:42.0171174Z     total-time: 327 seconds
2024-10-19T01:56:42.0172143Z     total-block-time: 54 blocks

2024-10-19T02:02:54.0237231Z     final report:jump-start
2024-10-19T02:02:54.0238198Z     total-time: 322 seconds
2024-10-19T02:02:54.0238731Z     total-block-time: 53 blocks

2024-10-19T02:04:30.0186155Z     final report:jump-start
2024-10-19T02:04:30.0187472Z     total-time: 23 seconds
2024-10-19T02:04:30.0188396Z     total-block-time: 3 blocks

JesseAbram · 2024-10-19T16:27:39Z

mmmmm this is interesting with all the evidence here Im leaning towards the CI nodes not being able to keep up with the blocks

Would explain the random failures
Only appears in CI
Only happened when we went from less nodes to more (2 -> 4 in sdk, and 1 -> 4 in core)
Happens less in core because it would only rely on the alice chain message since we mock the other ones

This is actually a lot of evidence, Ill try to think of ways to test this and talk to Vi about upping the CI machine monday

could possibly be something in our codebase but leaning to less likely due to the overwhelming evidence

HCastano added the Fix something that's not working as intended label Oct 18, 2024

HCastano assigned JesseAbram Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jumpstart failing in CI runs #1119

Jumpstart failing in CI runs #1119

HCastano commented Oct 18, 2024

frankiebee commented Oct 18, 2024 •

edited

Loading

frankiebee commented Oct 18, 2024

mixmix commented Oct 18, 2024

HCastano commented Oct 18, 2024

JesseAbram commented Oct 18, 2024 •

edited

Loading

frankiebee commented Oct 18, 2024

frankiebee commented Oct 18, 2024

JesseAbram commented Oct 19, 2024

frankiebee commented Oct 19, 2024

JesseAbram commented Oct 19, 2024 •

edited

Loading

Jumpstart failing in CI runs #1119

Jumpstart failing in CI runs #1119

Comments

HCastano commented Oct 18, 2024

frankiebee commented Oct 18, 2024 • edited Loading

frankiebee commented Oct 18, 2024

mixmix commented Oct 18, 2024

HCastano commented Oct 18, 2024

JesseAbram commented Oct 18, 2024 • edited Loading

frankiebee commented Oct 18, 2024

frankiebee commented Oct 18, 2024

JesseAbram commented Oct 19, 2024

frankiebee commented Oct 19, 2024

JesseAbram commented Oct 19, 2024 • edited Loading

frankiebee commented Oct 18, 2024 •

edited

Loading

JesseAbram commented Oct 18, 2024 •

edited

Loading

JesseAbram commented Oct 19, 2024 •

edited

Loading