Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix iperf no connection and PerfResult warmup timestamp calculation #382

Merged
merged 3 commits into from
Oct 21, 2024

Conversation

olichtne
Copy link
Collaborator

@olichtne olichtne commented Oct 3, 2024

Description

This should fix the issues we regularly see with lnst crashing with exception when iperf failed to establish a connection. The source of the issue is that the created perf result data structure was slightly different.

While testing this I also found issues with the warmup/warmdown timestamp calculation because the zero created zero iperf measurements don't have a duration long enough to include a warmup/warmdown... this however shouldn't crash the trimming and should instead return another zero measurement.

Tests

(Please provide a list of tests that prove that the pull
request doesn't break the stable state of the master branch. This should
include test runs with valid results for all of critical workflows.)

Reviews

@jtluka @enhaut @Axonis

Closes: #

When the iperf job fails we created a slightly different PerfResult data
structure which can create problems in other parts of the code which
assume a stable structure.

This fixes that by adding the one additional level of a
SequentialPerfResult.

Signed-off-by: Ondrej Lichtner <olichtne@redhat.com>
When creating these PerfIntervals representing zero flows we need to set
the duration to non zero to avoid zero division issues.

Signed-off-by: Ondrej Lichtner <olichtne@redhat.com>
@olichtne
Copy link
Collaborator Author

olichtne commented Oct 3, 2024

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...

I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...

the original implementation comes from #248 so that will be relevant for testing this

Copy link
Collaborator

@jtluka jtluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my concern about the warmup/warmdown timestamps, this looks ok.

@jtluka
Copy link
Collaborator

jtluka commented Oct 8, 2024

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...

I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...

the original implementation comes from #248 so that will be relevant for testing this

Did you test this on any parallel iperf test we have internally? Eg. 35469f23-2f89-4b8a-a40d-d29395f71fdf (96 parallel flows)

@olichtne
Copy link
Collaborator Author

olichtne commented Oct 8, 2024

this will need some testing on real data to see if the warmup/warmdown trimming isn't causing issues...
I tested this locally by printing the difference between the old calculation and the new one and the difference was minimal - 0.002s so IMO... this shouldn't be a problem, but we want to be sure...
the original implementation comes from #248 so that will be relevant for testing this

Did you test this on any parallel iperf test we have internally? Eg. 35469f23-2f89-4b8a-a40d-d29395f71fdf (96 parallel flows)

this is an arm test which i can't run at the moment...

instead i'll schedule a job with all tests that have perf_parallel_processes > 1 on RHEL9 that are in kernel-full group and see what happens...

@olichtne
Copy link
Collaborator Author

olichtne commented Oct 8, 2024

test job scheduled: Submitted: ['J:9976969']

Edit: made a mistake in scheduling... new job: Submitted: ['J:9977354']
Submitted: ['J:9977534']

jtluka
jtluka previously approved these changes Oct 8, 2024
@olichtne
Copy link
Collaborator Author

olichtne commented Oct 17, 2024

scheduled a test for the other solution to check the difference J:10037824

@olichtne
Copy link
Collaborator Author

Compared the "Option A" test run vs "Option B" test run here:
a_vs_b.txt

There's a couple of tests where there are different results seen however IMO these are mostly due to the instable nature of multistream performance tests that we usually see anyway.

Both test runs evaluates just OK when compared to historical baselines.

So IMO both implementations are basically equivalent.

As one more step, I'll run a multistream test with 64 threads locally and print out the calculation of the warmup_end calculation to compare the option a, option b and the old original implementation to show here

@olichtne
Copy link
Collaborator Author

So local comparison with running a test in containers is here.

The warmup_end code i used:

    @property
    def warmup_end(self):
        old = max(
            [
                parallel[self.warmup_duration - 1].end_timestamp
                for parallel in (
                    *self.generator_results,
                    *self.receiver_results,
                )
            ])
        optionA = self.start_timestamp + self.warmup_duration
        optionB = max(
                [
                    self.generator_results.start_timestamp,
                    self.generator_cpu_stats.start_timestamp,
                    self.receiver_results.start_timestamp,
                    self.receiver_cpu_stats.start_timestamp,
                ]
            ) + self.warmup_duration

        import logging
        logging.error(f"AAAAAAAAAAA old: {old}")
        logging.error(f"AAAAAAAAAAA A: {optionA}")
        logging.error(f"AAAAAAAAAAA B: {optionB}")
        return (
            max(
                [
                    self.generator_results.start_timestamp,
                    self.generator_cpu_stats.start_timestamp,
                    self.receiver_results.start_timestamp,
                    self.receiver_cpu_stats.start_timestamp,
                ]
            )
            + self.warmup_duration
        )

the output i got:

-> % rg 'AAAAAAAAAAA old' -A2 Logs
Logs/2024-10-18_14:00:06/SimpleNetworkRecipe_match_0/info
1465:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729252874.167374
1466-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729252874
1467-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729252874
1468:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729252971.003915
1469-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729252971
1470-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729252971
1471:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253068.265373
1472-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253068
1473-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253068
1474:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253163.374155
1475-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253163
1476-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253163
1477:2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA old: 1729253262.3002691
1478-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA A: 1729253262
1479-2024-10-18 14:08:48       (localhost)        -   ERROR: AAAAAAAAAAA B: 1729253262

We can see that the optionA and optionB are identical and they're always basically at most 0.3s off of the old method.

The recipe i used for this locally was with this configuration:

recipe = SimpleNetworkRecipe(
    perf_parallel_processes=64,
    offload_combinations=[{'gro': 'on', 'gso': 'on', 'tso': 'on', 'tx': 'on', 'rx': 'on'}],
    perf_duration=60,
    ip_versions=['ipv4'],
    perf_tests=['tcp_stream'],
    perf_msg_sizes=[16384],
    minimal_idlestates_latency=0,
    drop_caches=True,
    perf_warmup_duration=3,
    disable_turboboost=True,
)

And it is worth noting that the startup of the iperf clients took a while as the difference between the first and last client is sizable:

2024-10-18 14:07:09       (localhost)        -   DEBUG: Host host2 executing job 271: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)
2024-10-18 14:07:09       (localhost)        -   DEBUG: Job.what = IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)
2024-10-18 14:07:09       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12000)

...

2024-10-18 14:07:14       (localhost)        -   DEBUG: Host host2 executing job 334: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14       (localhost)        -   DEBUG: Job.what = IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host2), netns(None), IperfServer(bind=Ip4Address(192.168.101.2/24), oneoff=True, port=12063)
2024-10-18 14:07:14           (host2)        -   DEBUG: Running job 334 with pid "1010"
2024-10-18 14:07:14           (host2)        -   DEBUG: compiled command:  iperf3 -s -B 192.168.101.2 -J -p 12063 -1
2024-10-18 14:07:14           (host2)        -   DEBUG: running as server ...
2024-10-18 14:07:16       (localhost)        -   DEBUG: Host host1 executing job 272: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, clien
t_port=12000)
2024-10-18 14:07:16       (localhost)        -   DEBUG: Job.what = IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, client_port=12000)
2024-10-18 14:07:16       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12000, cl

...

2024-10-18 14:07:39       (localhost)        -    INFO: Result: PASS, What: Job started: type(module), host(host1), netns(None), IperfClient(server=Ip4Address(192.168.101.2/24), duration=60, warmup_duration=3, blksize=16384, port=12063, cl
ient_port=12063)
2024-10-18 14:07:39           (host1)        -   DEBUG: Running job 335 with pid "1013"
2024-10-18 14:07:39       (localhost)        -   DEBUG: Waiting for Job 272 on Host host1 for 71 seconds.
2024-10-18 14:07:39           (host1)        -   DEBUG: Measuring for 66 seconds (perf_duration + perf_warmup_duration * 2).
2024-10-18 14:07:39           (host1)        -   DEBUG: compiled command:  iperf3 -c 192.168.101.2 -b 0/1000 -J -t 66   -l 16384  -p 12063 --cport 12063
2024-10-18 14:07:39           (host1)        -   DEBUG: running as client ...

@olichtne
Copy link
Collaborator Author

Did the same with the warmdown_start for completion:

    @property
    def warmdown_start(self):
        old = min(
            [
                parallel[-self.warmup_duration].start_timestamp
                for parallel in (
                    *self.generator_results,
                    *self.receiver_results,
                )
            ])
        optionA = self.end_timestamp - self.warmup_duration
        optionB = min(
                [
                    self.generator_results.end_timestamp,
                    self.generator_cpu_stats.end_timestamp,
                    self.receiver_results.end_timestamp,
                    self.receiver_cpu_stats.end_timestamp,
               max(  ]
            ) - self.warmup_duration

        import logging
        logging.error(f"BBBBBBBBBBB old: {old}")
        logging.error(f"BBBBBBBBBBB A: {optionA}")
        logging.error(f"BBBBBBBBBBB B: {optionB}")
        return (
            min(
                [
                    self.generator_results.end_timestamp,
                    self.generator_cpu_stats.end_timestamp,
                    self.receiver_results.end_timestamp,
                    self.receiver_cpu_stats.end_timestamp,
                ]
            )
            - self.warmup_duration
        )

logs:

Logs/2024-10-18_14:30:00/SimpleNetworkRecipe_match_0/debug
6902:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254727.004715
6903-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254727.135558
6904-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254727
--
6909:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254829.009228
6910-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254829.005448
6911-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254829
--
6916:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729254930.013833
6917-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729254930.123065
6918-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729254930
--
6923:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729255027.000226
6924-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729255027.104802
6925-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729255027
--
6930:2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB old: 1729255120.022853
6931-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB A: 1729255120.036433
6932-2024-10-18 14:38:45       (localhost)        -   ERROR: BBBBBBBBBBB B: 1729255120

here we see some small difference between A and B, but when comparing A to old or B to old imo the difference is irrelevant.

@olichtne
Copy link
Collaborator Author

with the new data and testing i believe the simpler option A implementation is better and there should be no real impact on any of our testing.

Copy link
Collaborator

@jtluka jtluka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my only comment, this looks ok.

This index based calculation is inaccurate if the data structure doesn't
reflect a single very specific usecase... instead we can simply
calculate the timestamps by addition...

Signed-off-by: Ondrej Lichtner <olichtne@redhat.com>
@olichtne olichtne merged commit a42be24 into LNST-project:master Oct 21, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants