Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instinct MI100 cluster fails to reset on restart #80

Open
TNT3530 opened this issue May 3, 2024 · 2 comments
Open

Instinct MI100 cluster fails to reset on restart #80

TNT3530 opened this issue May 3, 2024 · 2 comments

Comments

@TNT3530
Copy link

TNT3530 commented May 3, 2024

ProxMox 7.3-3, Kernel 5.15.53-1-pve

applied the changes here to get it functioning with this kernel, double checking that all PCIe device reset_method values are correctly device_specific

First guest boot shows
image
but all GPUs pass through fine

Attempting to shutdown and restart the guest causes this:
image
ending in the guest failing to boot with atombios stuck in loop for more than 20secs aborting
image

@gnif
Copy link
Owner

gnif commented May 8, 2024

Your method of setting the reset to device specific is not supported, you are supposed to use the udev rules as provided in the project. Your service may be running too late and the inbuilt reset may have already been used at some point during boot.

If this does not solve the problem, I am sorry but there is not much else we can do here.

@TNT3530
Copy link
Author

TNT3530 commented May 8, 2024

I have the dkms module loaded in the proxmox host
image

and activated in my /etc/modules
image

with the service disabled, here is the initial boot
image

And all GPUs pass-through fine.

Upon restarting in the guest, this is what spits out
image

searching dmesg | grep reset returns nothing other than the above and a few USB devices, and dmesg | grep vfio has no new lines so i assume it isn't running

Moving the vendor-reset in /etc/modules to the first line does the same thing as above, but with the bonus of
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants