Skip to content

Commit

Permalink
Switch to Python-based power/energy monitor
Browse files Browse the repository at this point in the history
  • Loading branch information
jaywonchung committed Aug 23, 2023
1 parent e610849 commit 2dbd5a6
Show file tree
Hide file tree
Showing 14 changed files with 634 additions and 199 deletions.
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
---
**Project News**

- \[2023/07\] [`ZeusMonitor`](https://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor) was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard](https://ml.energy/leaderboard).
- \[2023/07\] [`ZeusMonitor`](https://ml.energy/zeus/reference/monitor/#zeus.monitor.ZeusMonitor) was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard & Colosseum](https://ml.energy/leaderboard).
- \[2023/03\] [Chase](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf), an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
- \[2022/11\] [Carbon-Aware Zeus](https://taikai.network/gsf/hackathons/carbonhack22/projects/cl95qxjpa70555701uhg96r0ek6/idea) won the **second overall best solution award** at Carbon Hack 22.
---
Expand Down Expand Up @@ -59,6 +59,35 @@ for x, y in train_dataloader:
plo.on_epoch_end()
```

### CLI power and energy monitor

```console
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
```

```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
```

Please refer to our NSDI’23 [paper](https://www.usenix.org/conference/nsdi23/presentation/you) and [slides](https://www.usenix.org/system/files/nsdi23_slides_chung.pdf) for details.
Checkout [Overview](https://ml.energy/zeus/overview/) for a summary.

Expand Down
6 changes: 5 additions & 1 deletion docs/getting_started/installing_and_building.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Installing and Building Zeus Components

This document explains how to install the [`zeus`][zeus] Python package and how to build the [Zeus power monitor](https://github.com/SymbioticLab/Zeus/tree/master/zeus_monitor).
This document explains how to install the [`zeus`][zeus] Python package.

!!! Tip
We encourage users to utilize our Docker image. Please refer to [Environment setup](./environment.md). Quick command:
Expand Down Expand Up @@ -49,6 +49,10 @@ pip install -e .

## Zeus power monitor

!!! Warning
The C++ Zeus power monitor is now deprecated as we've switched to a Python-based power monitor.
See [`PowerMonitor`][zeus.monitor.power.PowerMonitor] or run `python -m zeus.monitor --help`.

### Dependencies

All dependencies are pre-installed if you're using our Docker image.
Expand Down
31 changes: 30 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ hide:
---
**Project News**

- \[2023/07\] [`ZeusMonitor`][zeus.monitor.ZeusMonitor] was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard](https://ml.energy/leaderboard).
- \[2023/07\] [`ZeusMonitor`][zeus.monitor.ZeusMonitor] was used to profile GPU time and energy consumption for the [ML.ENERGY leaderboard & Colosseum](https://ml.energy/leaderboard).
- \[2023/03\] [Chase](https://symbioticlab.org/publications/files/chase:ccai23/chase-ccai23.pdf), an automatic carbon optimization framework for DNN training, will appear at ICLR'23 workshop.
- \[2022/11\] [Carbon-Aware Zeus](https://taikai.network/gsf/hackathons/carbonhack22/projects/cl95qxjpa70555701uhg96r0ek6/idea) won the **second overall best solution award** at Carbon Hack 22.
---
Expand Down Expand Up @@ -62,6 +62,35 @@ for x, y in train_dataloader:
plo.on_epoch_end()
```

### CLI power and energy monitor

```console
$ python -m zeus.monitor power
[2023-08-22 22:39:59,787] [PowerMonitor](power.py:134) Monitoring power usage of GPUs [0, 1, 2, 3]
2023-08-22 22:40:00.800576
{'GPU0': 66.176, 'GPU1': 68.792, 'GPU2': 66.898, 'GPU3': 67.53}
2023-08-22 22:40:01.842590
{'GPU0': 66.078, 'GPU1': 68.595, 'GPU2': 66.996, 'GPU3': 67.138}
2023-08-22 22:40:02.845734
{'GPU0': 66.078, 'GPU1': 68.693, 'GPU2': 66.898, 'GPU3': 67.236}
2023-08-22 22:40:03.848818
{'GPU0': 66.177, 'GPU1': 68.675, 'GPU2': 67.094, 'GPU3': 66.926}
^C
Total time (s): 4.421529293060303
Total energy (J):
{'GPU0': 198.52566362297537, 'GPU1': 206.22215216255188, 'GPU2': 201.08565518283845, 'GPU3': 201.79834523367884}
```

```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Measurement(time=3.4480526447296143, energy={0: 224.2969999909401, 1: 232.83799999952316, 2: 233.3100000023842, 3: 234.53700000047684})
```

Please refer to our NSDI’23 [paper](https://www.usenix.org/conference/nsdi23/presentation/you) and [slides](https://www.usenix.org/system/files/nsdi23_slides_chung.pdf) for details.
Checkout [Overview](overview/index.md) for a summary.

Expand Down
2 changes: 1 addition & 1 deletion examples/ZeusDataLoader/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Examples for `ZeusDataLoader`
# [Deprecated] Examples using `ZeusDataLoader`

The `ZeusDataLoader` is on its way to deprecation, as we attempt to transition to new constructs including `ZeusMonitor`, `GlobalPowerLimitOptimizer`, and `EarlyStopController`.
We'll keep these old examples around for while `ZeusDataLoader` still exists.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ select = [
"B", # flake8-bugbear (detects likely bugs)
"G", # flake8-logging-format (complains about logging)
"SIM", # flake8-simplify (suggests code simplifications)
"RUF", # Ruff-introduced misc rules
]
ignore = [
"PLW0603", # Global statement
"B019", # Usage of functools.lru_cache
"PLR0913", # Too many function arguments
"PLR0912", # Too many branches
"B905", # zip strict argument
"PLR0915", # Too many statements
"PLR2004", # Magic values
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
"scikit-learn",
"nvidia-ml-py",
"pydantic",
"rich",
],
python_requires=">=3.8",
extras_require=extras_require,
Expand Down
100 changes: 38 additions & 62 deletions tests/test_monitor.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,15 @@
@pytest.fixture
def pynvml_mock(mocker: MockerFixture):
"""Mock the entire pynvml module."""
mock = mocker.patch("zeus.monitor.pynvml", autospec=True)
mock = mocker.patch("zeus.monitor.energy.pynvml", autospec=True)

# Except for the arch constants.
mock.NVML_DEVICE_ARCH_PASCAL = pynvml.NVML_DEVICE_ARCH_PASCAL
mock.NVML_DEVICE_ARCH_VOLTA = pynvml.NVML_DEVICE_ARCH_VOLTA
mock.NVML_DEVICE_ARCH_AMPERE = pynvml.NVML_DEVICE_ARCH_AMPERE

mocker.patch("zeus.util.env.pynvml", mock)

return mock


Expand Down Expand Up @@ -151,22 +153,27 @@ def test_monitor(pynvml_mock, mock_gpus, mocker: MockerFixture, tmp_path: Path):
num_gpus = len(gpu_archs)
is_old_nvml = {index: arch < pynvml.NVML_DEVICE_ARCH_VOLTA for index, arch in zip(nvml_gpu_indices, gpu_archs)}
is_old_torch = {index: arch < pynvml.NVML_DEVICE_ARCH_VOLTA for index, arch in zip(torch_gpu_indices, gpu_archs)}
num_old_archs = sum(is_old_nvml.values())
old_gpu_torch_indices = [index for index, is_old in is_old_torch.items() if is_old]

mocker.patch("zeus.monitor.energy.atexit.register")

mkdtemp_mock = mocker.patch("zeus.monitor.tempfile.mkdtemp", return_value="mock_log_dir")
which_mock = mocker.patch("zeus.monitor.shutil.which", return_value="zeus_monitor")
popen_mock = mocker.patch("zeus.monitor.subprocess.Popen", autospec=True)
mocker.patch("zeus.monitor.atexit.register")
class MockPowerMonitor:
def __init__(self, gpu_indices: list[int] | None, update_period: float | None) -> None:
assert gpu_indices == old_gpu_torch_indices
self.gpu_indices = gpu_indices
self.update_period = update_period
def get_energy(self, start: float, end: float) -> dict[int, float]:
return {i: -1.0 for i in self.gpu_indices}
mocker.patch("zeus.monitor.energy.PowerMonitor", MockPowerMonitor)

monotonic_counter = itertools.count(start=4, step=1)
mocker.patch("zeus.monitor.time.monotonic", side_effect=monotonic_counter)
time_counter = itertools.count(start=4, step=1)
mocker.patch("zeus.monitor.energy.time", side_effect=time_counter)

energy_counters = {
f"handle{i}": itertools.count(start=1000, step=3)
for i in nvml_gpu_indices if not is_old_nvml[i]
}
pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.side_effect = lambda handle: next(energy_counters[handle])
energy_mock = mocker.patch("zeus.monitor.analyze.energy")

log_file = tmp_path / "log.csv"

Expand All @@ -175,32 +182,6 @@ def test_monitor(pynvml_mock, mock_gpus, mocker: MockerFixture, tmp_path: Path):
########################################
monitor = ZeusMonitor(gpu_indices=gpu_indices, log_file=log_file)

if num_old_archs > 0:
assert mkdtemp_mock.call_count == 1
assert which_mock.call_count == 1
else:
assert mkdtemp_mock.call_count == 0
assert which_mock.call_count == 0

# Zeus monitors should only have been spawned for GPUs with old architectures using NVML indices.
assert popen_mock.call_count == num_old_archs
calls = []
for nvml_gpu_index, torch_gpu_index in zip(nvml_gpu_indices, torch_gpu_indices):
if is_old_nvml[nvml_gpu_index]:
calls.append(call([
"zeus_monitor",
monitor._monitor_log_path(torch_gpu_index),
"0",
"100",
str(nvml_gpu_index),
]))
if calls:
popen_mock.assert_has_calls(calls)
assert list(monitor.monitors.keys()) == [i for i in torch_gpu_indices if is_old_torch[i]]

# Start time would be 4, as specified in the counter constructor.
assert monitor.monitor_start_time == 4

# Check GPU index parsing from the log file.
replay_monitor = ReplayZeusMonitor(gpu_indices=None, log_file=log_file)
assert replay_monitor.gpu_indices == list(torch_gpu_indices)
Expand All @@ -210,17 +191,16 @@ def test_monitor(pynvml_mock, mock_gpus, mocker: MockerFixture, tmp_path: Path):
########################################
def tick():
"""Calling this function will simulate a tick of time passing."""
next(monotonic_counter)
next(time_counter)
for counter in energy_counters.values():
next(counter)

def assert_window_begin(name: str, begin_time: int):
"""Assert monitor measurement states right after a window begins."""
assert monitor.measurement_states[name][0] == begin_time
assert monitor.measurement_states[name][1] == {
# `begin_time` is actually one tick ahead from the perspective of the
# energy counters, so we subtract 5 instead of 4.
i: pytest.approx((1000 + 3 * (begin_time - 5)) / 1000.0)
# `4` is the time origin of `time_counter`.
i: pytest.approx((1000 + 3 * (begin_time - 4)) / 1000.0)
for i in torch_gpu_indices if not is_old_torch[i]
}
pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.assert_has_calls([
Expand All @@ -247,6 +227,7 @@ def assert_measurement(
assert name not in monitor.measurement_states
assert num_gpus == len(measurement.energy)
assert elapsed_time == measurement.time
assert set(measurement.energy.keys()) == set(torch_gpu_indices)
for i in torch_gpu_indices:
if not is_old_torch[i]:
# The energy counter increments with step size 3.
Expand All @@ -255,11 +236,6 @@ def assert_measurement(
if not assert_calls:
return

energy_mock.assert_has_calls([
call(f"mock_log_dir/gpu{i}.power.csv", begin_time - 4, begin_time + elapsed_time - 4)
for i in torch_gpu_indices if is_old_torch[i]
])
energy_mock.reset_mock()
pynvml_mock.nvmlDeviceGetTotalEnergyConsumption.assert_has_calls([
call(f"handle{i}") for i in nvml_gpu_indices if not is_old_nvml[i]
])
Expand All @@ -268,7 +244,7 @@ def assert_measurement(

# Serial non-overlapping windows.
monitor.begin_window("window1", sync_cuda=False)
assert_window_begin("window1", 5)
assert_window_begin("window1", 4)

tick()

Expand All @@ -277,17 +253,17 @@ def assert_measurement(
monitor.begin_window("window1", sync_cuda=False)

measurement = monitor.end_window("window1", sync_cuda=False)
assert_measurement("window1", measurement, begin_time=5, elapsed_time=2)
assert_measurement("window1", measurement, begin_time=4, elapsed_time=2)

tick(); tick()

monitor.begin_window("window2", sync_cuda=False)
assert_window_begin("window2", 10)
assert_window_begin("window2", 9)

tick(); tick(); tick()

measurement = monitor.end_window("window2", sync_cuda=False)
assert_measurement("window2", measurement, begin_time=10, elapsed_time=4)
assert_measurement("window2", measurement, begin_time=9, elapsed_time=4)

# Calling `end_window` again with the same name should raise an error.
with pytest.raises(ValueError, match="does not exist"):
Expand All @@ -299,40 +275,40 @@ def assert_measurement(

# Overlapping windows.
monitor.begin_window("window3", sync_cuda=False)
assert_window_begin("window3", 15)
assert_window_begin("window3", 14)

tick()

monitor.begin_window("window4", sync_cuda=False)
assert_window_begin("window4", 17)
assert_window_begin("window4", 16)

tick(); tick();

measurement = monitor.end_window("window3", sync_cuda=False)
assert_measurement("window3", measurement, begin_time=15, elapsed_time=5)
assert_measurement("window3", measurement, begin_time=14, elapsed_time=5)

tick(); tick(); tick();

measurement = monitor.end_window("window4", sync_cuda=False)
assert_measurement("window4", measurement, begin_time=17, elapsed_time=7)
assert_measurement("window4", measurement, begin_time=16, elapsed_time=7)


# Nested windows.
monitor.begin_window("window5", sync_cuda=False)
assert_window_begin("window5", 25)
assert_window_begin("window5", 24)

monitor.begin_window("window6", sync_cuda=False)
assert_window_begin("window6", 26)
assert_window_begin("window6", 25)

tick(); tick();

measurement = monitor.end_window("window6", sync_cuda=False)
assert_measurement("window6", measurement, begin_time=26, elapsed_time=3)
assert_measurement("window6", measurement, begin_time=25, elapsed_time=3)

tick(); tick(); tick();

measurement = monitor.end_window("window5", sync_cuda=False)
assert_measurement("window5", measurement, begin_time=25, elapsed_time=8)
assert_measurement("window5", measurement, begin_time=24, elapsed_time=8)

########################################
# Test content of `log_file`.
Expand Down Expand Up @@ -363,12 +339,12 @@ def assert_log_file_row(row: str, name: str, begin_time: int, elapsed_time: int)
if not is_old_torch[gpu_index]:
assert float(pieces[3 + i]) == pytest.approx(elapsed_time * 3 / 1000.0)

assert_log_file_row(lines[1], "window1", 5, 2)
assert_log_file_row(lines[2], "window2", 10, 4)
assert_log_file_row(lines[3], "window3", 15, 5)
assert_log_file_row(lines[4], "window4", 17, 7)
assert_log_file_row(lines[5], "window6", 26, 3)
assert_log_file_row(lines[6], "window5", 25, 8)
assert_log_file_row(lines[1], "window1", 4, 2)
assert_log_file_row(lines[2], "window2", 9, 4)
assert_log_file_row(lines[3], "window3", 14, 5)
assert_log_file_row(lines[4], "window4", 16, 7)
assert_log_file_row(lines[5], "window6", 25, 3)
assert_log_file_row(lines[6], "window5", 24, 8)

########################################
# Test replaying from the log file.
Expand Down
23 changes: 23 additions & 0 deletions zeus/monitor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Copyright (C) 2023 Jae-Won Chung <jwnchung@umich.edu>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Time, energy, and power monitors for Zeus.
The main class of this module is [`ZeusMonitor`][zeus.monitor.energy.ZeusMonitor].
If users wish to monitor power consumption over time, the [`power`][zeus.monitor.power]
module can come in handy.
"""

from zeus.monitor.energy import ZeusMonitor, Measurement
Loading

0 comments on commit 2dbd5a6

Please sign in to comment.