-
-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behaviors of computing CPU% when time "delta" is zero #1274
Comments
FWIW, I suspect this is a case of "perfect is the enemy of good". I cannot see any situation where the delta can be zero ... AFAICS it will not happen in htop, have you observed this? Even if we make two calls to clock_gettime immediately one after the other (which we certainly never do, there's always heaps of time between them in practice), we'll still see a small time difference and small delta value. IMO this issue should just be closed as its just likely to cause code churn - best not to chase our tails on things that can "theoretically happen" when in practice they cannot. |
@natoscott There is not even an assertion for that. And the time source isn't just clock_gettime. There are other time sources where "less than one unit of time difference" is possible between two measures. The first part of the code where I found "less than one unit of time difference" is of concern is the |
@Explorer09 I'm not sure I understand what you mean here - you're saying "there is not even an assertion for that" but in the original post you stated ~10 times "when delta is zero". So, AIUI, there is definitely an assertion that delta will be zero ... am I misunderstanding? In the netlink case where a unit conversion between usec and nsec is needed, it sounds like it would be simpler to just multiply the usec value by 1000 so we're on common units, and still with no possibility of a zero delta. You understand my concern that this is just code churn, right? No actual problem is being solved here AFAICT, so we probably should not be doing this unless it intrinsically makes the code better (more efficient?) in some other way. But I'm just not really seeing a convincing argument for that so far (I'm happy to be convinced though), the rationale seems to be based on "when delta is zero" which as I say, I don't think can actually happen. |
Update: Sorry for not explaining this clear enough (I was sleepy when I wrote the initial version of this report). I mean there are several approaches for the case where the "delta time" is zero:
The goal was to eliminate the division by zero possibly, so that unnecessary |
@Explorer09 of those three, I'd vote for option 1 I guess. There's also option 0, which I also very much support, which is "no changes" since there seems to be no actual issue being addressed here AFAICS. |
I think, – and that's why I asked to split this from the original issue –, that having various platforms behave differently for what should be the same operation (get CPU usage, get Process usage) should be consistent. As listed in the initial post, this currently is not the case. Regarding the question, what the "convergent" behaviour should be in the end, I prefer an approach, where the visible result is sensible and comprehensible to the user. Based on the various implementation mentioned at the top, I think the options for handling, when the delta time is less than some minimal interval, are:
While IMO assuming a minimal interval (of 1ms, which is plenty for htop) is possible, the problem is that values measured on those short timescales are numerically unstable. Thus you are likely to risk showing wildly inaccurate data. Thus a consistent way to go forward here is actually to drop such measurements, which would mean to explicitly mark them as NaN, as no reliable measurement is available. While using zero here could be considered, you would be having the risk of showing conflicting data all over again. Nothing is worse than confidently showing 0% CPU usage while directly below a bunch of processes are shown to be churning away at your CPU at 100% … TL;DR: If the delta time is below a minimal interval of 1ms reject such values and report NaN instead. |
I disagree with the assumption that "there is no issue", because there are other parts of htop code that check if the time "period" is zero, and for the "option 0" as @natoscott mentioned, the ideal would be to remove those checks as well. Commit d1db9da (in Linux code) is one of them that addressed the case that the delta time could be zero, and it's better to address all of such cases consistently. By the way, it's me that split this issue from #1271 because I think this issue deserves more discussion and is not a trivial task to make it with that pull request. As for @BenBE 's comments… And you guys may have noticed I proposed #1275 to reject the zero fixed scale (for BSD platforms). |
But this commit is not dealing with the time delta we've been discussing. Its a weak comparison (ticks vs time) - one is a Linux-specific corner case where we know full well the ticks-based denominator can be zero - whereas here we've been discussing time delta between consecutive clock_gettime(2) samples with a substantial delay, where we always know the time-difference denominator is non-zero. I'm concerned about sprinkling dead code throughout htop, which is what these additional branches would amount to. Usually we remove dead code whenever we find it. |
For me it's the same issue, that htop doesn't handle external data properly. Data gathered from external sources should be assumed untrustworthy, so we are supposed to filter unlikely input, cutting corner cases no matter how unlikely it can happen.
Yes, such branches might look "dead" or very unlikely to happen, and I am trying not to introduce anything substantial to them. Usually it's about setting variables to sane values, or jump to an error branch that already exists (the approach I picked in the PR #1275 ) |
I originally intended to address this issue in pull request #1271 before I found out the issue is too big to handle in a single commit.
The CPU percentages of the machine and processes are usually computed through a formula like this:
(float) (newUsedTime - oldUsedTime) / (newElapsedTime - oldElapsedTime) * 100.0
The denominator represents a difference between the current time and the timestamp of last measurement. However, when this time difference is zero, platform codes of htop handle them differently, and some of them are a bit buggy.
Here is a summary of the behaviors of different htop platform codes:
kernelFScale
) which is not checked for zero.fscale
is zero, the percentage is set to zero.isnan
check.Source code references (I use commit 2d4e5cb from
main
as a base; for convenience this page marks all lines that are referenced):Yes. Platform codes have different behaviors when the time "delta" in the denominator is zero.
The result I wish is all platform codes behave consistently in this case, with explicit checks on whether the time "delta" is zero. The time "delta" being zero is not necessary a bug (it can theoretically happen when the measures are made too fast; faster than the OS can update the process status).
Update: Sorry for not explaining this clear enough (I was sleepy when I wrote the initial version of this report). There are several approaches for the case where the "delta time" is zero:
assert()
error, but skip the check in non-debug builds of htop.The goal was to eliminate the division by zero possibly, so that unnecessary
isnan
checks can be removed (I did #1271 to replaceisnan
with better alternatives). htop uses NaN values mainly for "data not available", "not supported by machine or OS" or "error retrieving data", and not for indicating arithmetic errors (such as division by zero). So the possibility of NaN be slipped from here looks more like a bug to me. By the way, I prefer the approach no. 2 if I can choose.The text was updated successfully, but these errors were encountered: