Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD integration #132

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

AMD integration #132

wants to merge 5 commits into from

Conversation

parthraut
Copy link
Collaborator

@parthraut parthraut commented Oct 17, 2024

  • Added amdsmi to project dependency
  • Implemented a non-blocking contructor of AMDGPU using concurrent.futures.ThreadPoolExecutor (AMDGPUs constructor takes 0.5s (polling time) regardless of number of GPUs)
  • Right now, it succeeds if the measured value is in 1% of expected value, and waits 0.5s. These can be changed.

A few questions:

  1. What do you think about the threshold (1%) and wait time (0.5s)?
  2. This is failing pyright check:
    zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
    I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at a high level. Thanks for your work!

I think 0.5 seconds is just right, assuming it works.

Comment on lines +258 to +260
def supportsGetTotalEnergyConsumption(
self, executor: concurrent.futures.ThreadPoolExecutor
) -> concurrent.futures.Future:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think altering the signature breaks the abstract class contract for GPU and GPUs.

Comment on lines +268 to +276
power = self.getInstantPowerUsage()
initial_energy = self.getTotalEnergyConsumption()
time.sleep(wait_time)
final_energy = self.getTotalEnergyConsumption()

measured_energy = final_energy - initial_energy
expected_energy = (
power * wait_time
) # power is in mW, wait_time is in seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected energy here contains an assumption that the GPU will maintain its activity level and power draw during the 0.5 second (or any non-trivial length) period. However, after this check and the 0.5 second wait monitoring period asynchronously begins, Deep Learning scripts that use Zeus will perform its own initialization, like loading the model to train on the GPU, which will change power draw. The whole point of implementing this check in an asynchronous manner is to allow such work to overlap with this check, so this is a logical contradiction and a race condition.

I advocate just making this check block for 0.5 seconds, and using a thread pool executor to parallelize this check across four GPUs is a great idea, so that regardless of the number of GPUs, the check will only take 0.5 seconds. I think 0.5 seconds is not big deal during initialization.

self._supportsGetTotalEnergyConsumption = True
else:
self._supportsGetTotalEnergyConsumption = False
logger.warning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe WARNING is a bit too high. Energy measurement will continue to work via power polling. I think we should make it an INFO and the message should be a bit more detailed. With the current message, people will think energy measurement is not going to work at all on this GPU.

elif "power" in energy_dict and "counter_resolution" in energy_dict: # Old API
energy = energy_dict["power"] * energy_dict["counter_resolution"]
else:
raise ValueError("Unexpected energy dictionary format")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we define a better exception class for this kind of error? If so, let's use it. Otherwise, let's not bother.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants