-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD integration #132
base: master
Are you sure you want to change the base?
AMD integration #132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at a high level. Thanks for your work!
I think 0.5 seconds is just right, assuming it works.
def supportsGetTotalEnergyConsumption( | ||
self, executor: concurrent.futures.ThreadPoolExecutor | ||
) -> concurrent.futures.Future: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think altering the signature breaks the abstract class contract for GPU
and GPUs
.
power = self.getInstantPowerUsage() | ||
initial_energy = self.getTotalEnergyConsumption() | ||
time.sleep(wait_time) | ||
final_energy = self.getTotalEnergyConsumption() | ||
|
||
measured_energy = final_energy - initial_energy | ||
expected_energy = ( | ||
power * wait_time | ||
) # power is in mW, wait_time is in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected energy here contains an assumption that the GPU will maintain its activity level and power draw during the 0.5 second (or any non-trivial length) period. However, after this check and the 0.5 second wait monitoring period asynchronously begins, Deep Learning scripts that use Zeus will perform its own initialization, like loading the model to train on the GPU, which will change power draw. The whole point of implementing this check in an asynchronous manner is to allow such work to overlap with this check, so this is a logical contradiction and a race condition.
I advocate just making this check block for 0.5 seconds, and using a thread pool executor to parallelize this check across four GPUs is a great idea, so that regardless of the number of GPUs, the check will only take 0.5 seconds. I think 0.5 seconds is not big deal during initialization.
self._supportsGetTotalEnergyConsumption = True | ||
else: | ||
self._supportsGetTotalEnergyConsumption = False | ||
logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe WARNING
is a bit too high. Energy measurement will continue to work via power polling. I think we should make it an INFO
and the message should be a bit more detailed. With the current message, people will think energy measurement is not going to work at all on this GPU.
elif "power" in energy_dict and "counter_resolution" in energy_dict: # Old API | ||
energy = energy_dict["power"] * energy_dict["counter_resolution"] | ||
else: | ||
raise ValueError("Unexpected energy dictionary format") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we define a better exception class for this kind of error? If so, let's use it. Otherwise, let's not bother.
A few questions:
zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.