Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose stats #279

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Expose stats #279

wants to merge 1 commit into from

Conversation

neob91-close
Copy link
Contributor

This allows hooking into the stats by defining a STATS_CALLBACK which will receive task runtimes.
This is useful for measuring the utilisation of workers.

@neob91-close neob91-close marked this pull request as ready for review April 20, 2023 08:24
@neob91-close neob91-close requested a review from nsaje April 20, 2023 08:24
Copy link
Member

@thomasst thomasst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to know some more details on how this will be used exactly before approving.

def __init__(self, stats: Stats) -> None:
super().__init__()

self.tiger = stats.tiger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this ideally just say self._stats_interval = stats.tiger.config["STATS_INTERVAL"] rather than looking it up every time?

# For example, the worker's utilisation over the last 30 minutes
# can be obtained by dividing the sum of task durations reported
# over the last 30 minutes by 30 minutes.
"STATS_CALLBACK": None,
Copy link
Member

@thomasst thomasst Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization). Otherwise this could be solved via something like CHILD_CONTEXT_MANAGERS, although that runs in the child process. Should this be called TASK_END_STATS_CALLBACK then?

the worker's utilisation over the last 30 minutes can be obtained by dividing the sum of task durations reported over the last 30 minutes by 30 minutes.

This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization).

Yep, that sounds better than what I did.

This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).

Good point.

We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.

Couldn't we achieve it with simply moving the callback to the log method to be called at an interval?
I don't think we need to get notified of the start and stop of every task.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get notified of the start and stop of every task.

Having that interface allows for reporting much more than what we do now. We could report utilization grouped per some other metric, for example task duration bucket, task name, or success/failure condition.

The interval of metric collection should be owned by the metric collection setup, not by tasktiger. Tasktiger does provide a simple log-only "monitoring" of periodic prints but in production we could use much more sophisticated stuff like opentelemetry. Reporting metrics on an interval is also not great if your metric collection is also pull-based periodic and the periods don't align well. For example, there was a task running between 1s and 16s. At 30s Tasktiger reports utilization of 50% (15s busy out of last 30s) and then does nothing for then next minute. At 59s (just before next log call at 60s) monitoring pulls the last reported value of 50%. The monitoring shows utilization of 50% at 59s while in reality it was 0%. The metric is lagging and can be very confusing when debugging performance issues. In contrast, with start/end notifications we can compute utilization metric right at the collection time - or better yet, report the up-to-date (not lagging) totals of busy/idle times and let the monitoring compute utilization.

@vtclose vtclose removed the request for review from tsx May 11, 2023 13:02
@nsaje nsaje removed their request for review August 31, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants