AMD integration #132

parthraut · 2024-10-17T03:04:23Z

Added amdsmi to project dependency
Implemented a non-blocking contructor of AMDGPU using concurrent.futures.ThreadPoolExecutor (AMDGPUs constructor takes 0.5s (polling time) regardless of number of GPUs)
Right now, it succeeds if the measured value is in 1% of expected value, and waits 0.5s. These can be changed.

A few questions:

What do you think about the threshold (1%) and wait time (0.5s)?
This is failing pyright check:
zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.

jaywonchung

Looked at a high level. Thanks for your work!

I think 0.5 seconds is just right, assuming it works.

jaywonchung · 2024-10-17T17:36:38Z

zeus/device/gpu/amd.py

+    def supportsGetTotalEnergyConsumption(
+        self, executor: concurrent.futures.ThreadPoolExecutor
+    ) -> concurrent.futures.Future:


I think altering the signature breaks the abstract class contract for GPU and GPUs.

jaywonchung · 2024-10-17T17:44:00Z

zeus/device/gpu/amd.py

+                power = self.getInstantPowerUsage()
+                initial_energy = self.getTotalEnergyConsumption()
+                time.sleep(wait_time)
+                final_energy = self.getTotalEnergyConsumption()
+
+                measured_energy = final_energy - initial_energy
+                expected_energy = (
+                    power * wait_time
+                )  # power is in mW, wait_time is in seconds


Expected energy here contains an assumption that the GPU will maintain its activity level and power draw during the 0.5 second (or any non-trivial length) period. However, after this check and the 0.5 second wait monitoring period asynchronously begins, Deep Learning scripts that use Zeus will perform its own initialization, like loading the model to train on the GPU, which will change power draw. The whole point of implementing this check in an asynchronous manner is to allow such work to overlap with this check, so this is a logical contradiction and a race condition.

I advocate just making this check block for 0.5 seconds, and using a thread pool executor to parallelize this check across four GPUs is a great idea, so that regardless of the number of GPUs, the check will only take 0.5 seconds. I think 0.5 seconds is not big deal during initialization.

jaywonchung · 2024-10-17T17:46:30Z

zeus/device/gpu/amd.py

+                    self._supportsGetTotalEnergyConsumption = True
+                else:
+                    self._supportsGetTotalEnergyConsumption = False
+                    logger.warning(


Maybe WARNING is a bit too high. Energy measurement will continue to work via power polling. I think we should make it an INFO and the message should be a bit more detailed. With the current message, people will think energy measurement is not going to work at all on this GPU.

jaywonchung · 2024-10-17T17:47:10Z

zeus/device/gpu/amd.py

+        elif "power" in energy_dict and "counter_resolution" in energy_dict:  # Old API
+            energy = energy_dict["power"] * energy_dict["counter_resolution"]
+        else:
+            raise ValueError("Unexpected energy dictionary format")


Did we define a better exception class for this kind of error? If so, let's use it. Otherwise, let's not bother.

parthraut added 5 commits October 16, 2024 15:30

incorporated API change for amdsmi

2f23f4a

added check for total_energy_consumption

d2d061a

added _supportsGetTotalEnergy check

4120fc2

made energy check async

45e33df

added amdsmi dep

156961c

jaywonchung reviewed Oct 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD integration #132

AMD integration #132

parthraut commented Oct 17, 2024 •

edited

Loading

jaywonchung left a comment •

edited

Loading

jaywonchung Oct 17, 2024

jaywonchung Oct 17, 2024

jaywonchung Oct 17, 2024

jaywonchung Oct 17, 2024

AMD integration #132

Are you sure you want to change the base?

AMD integration #132

Conversation

parthraut commented Oct 17, 2024 • edited Loading

jaywonchung left a comment • edited Loading

Choose a reason for hiding this comment

jaywonchung Oct 17, 2024

Choose a reason for hiding this comment

jaywonchung Oct 17, 2024

Choose a reason for hiding this comment

jaywonchung Oct 17, 2024

Choose a reason for hiding this comment

jaywonchung Oct 17, 2024

Choose a reason for hiding this comment

parthraut commented Oct 17, 2024 •

edited

Loading

jaywonchung left a comment •

edited

Loading