Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify tracking of highest and lowest trackable values #75

Open
marshallpierce opened this issue Dec 22, 2017 · 10 comments
Open

Clarify tracking of highest and lowest trackable values #75

marshallpierce opened this issue Dec 22, 2017 · 10 comments

Comments

@marshallpierce
Copy link
Collaborator

See #74 (comment) for more context.

We save in fields (and write when serializing) the requested limits, not the actual limits that result from the underlying encoding (which will encompass at least as much as what the user requested, and maybe more). Perhaps we should expose the actual limits of what a particular histogram can do, rather than just regurgitate the limits that the user requested? This would be useful when, say, storing metadata about histograms, since the data actually in the histogram is likely more interesting than the particular subset of values that were initially requested as trackable.

Strawman:

  • configured_low() for what the user requested when creating the histogram
  • actual_low() for what the histogram can support
  • configured_high(), actual_high()
@marshallpierce
Copy link
Collaborator Author

In particular for actual_low(), note that we don't want the value for index 0. That index would represent every value in [0, 1 << unit_magnitude), which isn't very useful since it doesn't obey the precision guarantees of the rest of the data structure. Instead, we want the value of index 1, which should be 1 << unit_magnitude.

@jonhoo
Copy link
Collaborator

jonhoo commented Dec 23, 2017

Are configured_low and configured_high even necessary?

@marshallpierce
Copy link
Collaborator Author

Nope, they aren't, but we have to keep track of those numbers for serialization anyway. We don't need to ship them in v1, so to speak, but we should probably leave room for them name-wise in case it turns out that something needs them. The use cases I could imagine are things like generic display tools for histograms that might want to list all the metadata or things like that.

@jonhoo
Copy link
Collaborator

jonhoo commented Dec 24, 2017

Hmm.. I'm partial to exposing the actual low and high as low and high, and then maybe exposing the configured values through a more verbose method.

@marshallpierce
Copy link
Collaborator Author

I have some reservations about that nomenclature.

  • We've already used those names for different concepts.
  • It's not very clear when reading foo.low() exactly what that means. A user who hasn't read the API docs (of the current version!) may be forgiven for thinking that the number returned is the min value recorded thus far, or the configured low.

I don't think those are deal breakers, so if your heart is set on low() and high(), I can live with it, but if we're breaking (at least semantically) backwards compatibility, I think we could just as well go with names that don't have those downsides.

@jonhoo
Copy link
Collaborator

jonhoo commented Dec 25, 2017

Well, we already have low, min, max, and high. I'm proposing that low and high be changed to return a more reasonable value than they currently do. One alternative would be to deprecate low and high, and instead introduce a range method which returns (actual_low, actual_high) + a configured_range which returns the configured low/high?

@marshallpierce
Copy link
Collaborator Author

Hm, I do kinda like returning something more structured -- what about perhaps even taking it one level further and exposing a struct that has all that sort of stuff in it (min, max, unit magnitude, etc)? That would help reduce the rather imposing number of methods on Histogram if we could funnel all that stuff through a helper type.

@jonhoo
Copy link
Collaborator

jonhoo commented Dec 27, 2017

Yeah, that's not a bad idea. Something like Histogram::statistics() -> HistogramStatistics. I guess we can bikeshed the name a little. I also like metrics and measure. Another question is whether or not we want to include measurements of the data (like max and min), and if so, where do we draw the line? Do we also want to start measuring (and report) the average? Number of samples?

@marshallpierce
Copy link
Collaborator Author

I think maybe we could divide the fields in Histogram up as follows:

  • Implementation details like leading_zero_count_base: semantically derived from other fields, but useful optimizations for fast path calculations.
  • Stuff that's dependent on the values stored, like max_value and total_count.
  • Stuff that's dependent only on initial config and not in the first group, like highest_trackable_value.

Assuming that doesn't end up seeming odd in practice, we could bundle up the last group and expose that to users.

@jonhoo
Copy link
Collaborator

jonhoo commented Nov 29, 2020

While going through my email, I came across some relevant notes from @giltene on the subject. The first is HdrHistogram/hdrhistogram-go#23 (comment). The second is from Gitter:

I like it, you’ve certainly made the lowest discernible changes clear. This is one of those situations where method overloading or optional parameters would come in handy, as allowing people to set lowestDiscernibleValue in the relatively rare cases where they need it to be non-1 creates a significant opportunity for confusion, which does not e use to the same level in languages where the commonly used histogram constructors don’t even use a lowest discernible argument.... my golang-foo is not strong enough to know what a good idiom to use here would be, if you wanted to encourage histogram creation that does not specify this parameter.

I think that a matching parameter name change to the one discussed here for golang would be appropriate for the C implementation, for the same user-confusion reasons. The difference between lowest discernible value and lowest traceable value may seem like a small semantic one, but I think there is a big difference in the likelihood of the api user misinterpreting the implications of setting the parameter to values other than 1.

Ditto for the Python implementation: I’d suggest changing lowest_trackable_value to lowest_discernible_value to cause people to think harder about the meaning and not jump to the assumption that it is simply the “min” of the range they want to cover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants