Skip to content
This repository has been archived by the owner on May 19, 2021. It is now read-only.

Constant costs might decrease due to numerical instabilities #12

Open
StephanErb opened this issue Jul 16, 2017 · 6 comments
Open

Constant costs might decrease due to numerical instabilities #12

StephanErb opened this issue Jul 16, 2017 · 6 comments

Comments

@StephanErb
Copy link
Member

Exported data contains long floating point numbers such as 0.0000032120027740023963. Due to the nature of the floating points numbers, aggregating those can lead to unstable results as the addition is non-commutative. This is problematic as it prevents us from defining meaningful aggregation rules over exported cost data.

Please see prometheus/prometheus#2951 for details.

We don't need full precision here, so we should round the results to 2 or 3 digits before emitting.

@StephanErb
Copy link
Member Author

Adopted workaround:

# FIXME: round is a workaround for https://github.com/blue-yonder/azure-cost-mon/issues/12
job:azure_costs_eur:sum =
    round(sum(azure_costs_eur))

@ManuelBahr
Copy link
Collaborator

The problem is even worse. Due to these instabilities it could happen that counters got smaller values when it should be constant. This results in a counter reset and totally invalidates the results of the increase or rate functions within Prometheus.

ManuelBahr added a commit that referenced this issue Jul 20, 2017
ManuelBahr added a commit that referenced this issue Jul 20, 2017
* fix for #12
* adapted changelog for the release
@ManuelBahr
Copy link
Collaborator

Released v0.4.1 to fix the issue.

@ManuelBahr
Copy link
Collaborator

Going to integers reduced the probability of the problem to occur, but we might still see flaps and counter resets as a result. Essentially, these are items that are constant in reality (resources that have been decommissioned) but the resulting value flaps around an integer border.

@StephanErb
Copy link
Member Author

We are using the following aggregation rules as a workaround right now:

#
# Azure
#
# We ignore the Prometheus rule format here by not specifying the lables in the series before
# the first column. We simply don't really know which ones are there. In any case, we still need
# the recording rule as increase/changes over 2 days are costly to compute for plots.
#

# Cost increase over the last 2 days
# We cannot use the normal increase function here as the Azure API is providing slighly
# fluctuating costs. Those would be interpreted as counter resets, leading to wrong results.
azure_costs_eur:increase2d =
  (azure_costs_eur - azure_costs_eur offset 2d)

# Number of updates from the Azure API over the last 2 days. The Azure API is providing changes once
# a day but not at the same time. So we expect this value to be either 1 or 2.
azure_costs_eur:changes2d =
  changes(azure_costs_eur[2d])

# This metric shows our total daily costs. Due to the slow moving counters provided by the Azure API,
# the value is computed as the average over the last 2 days. In Prometheus speak, we emit
# the average observation size over a 2 day time period. As we only have ~1 change per day
# this is our daily costs.
job:azure_costs_eur:mean2d =
  sum(
      (azure_costs_eur:increase2d > 0)
    /
      (azure_costs_eur:changes2d  > 0) # We need the > 0 filter to prevent the propagation of NaN.
  )

@ManuelBahr
Copy link
Collaborator

I think i ruled out that the non-commutativity of floats is the problem. Could track down the flaps to occur only during updates of the API. Also, I wasn't able to reproduce the flapping with python floats. So I suspect the server-side has the issue and not the code that aggregates. However rounding or truncating in some way might fix it.

@ManuelBahr ManuelBahr changed the title Long floating point numbers cause accuracy issues in Prometheus Constant costs might decrease due to numerical instabilities Sep 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants