Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binding and unbinding from amdgpu -> unstable Windows VM until reboot #52

Open
drujd opened this issue Jan 2, 2022 · 6 comments
Open

Comments

@drujd
Copy link

drujd commented Jan 2, 2022

I have a 640SP version of RX550 (Polaris11-based) and it seems that something is missing from its reset routine to work correctly in a guest Windows 11 VM after using it with Linux/amdgpu driver before that (regardless of whether that happens on a host or in a guest Linux VM).

If the GPU is never bound to amdgpu (vfio-pci.ids=1002:67ff,1002:aae0 kernel param), it works perfectly. I can reboot, reset, shutdown & start the VM again and all is fine (but I think that was the case even without this module).

However, once I actually use the GPU in Linux (whether in a host or guest system doesn't matter), it is 'doomed' for Windows usage until (host) reboot. The VM actually seems to work at first and boots to windows, but after a while in a desktop (or immediately if I e.g. try to start Edge), the driver (21.12.1) crashes, screen blinks many times and after a while, Windows falls back to the basic driver. Reboot / hard reset / shutdown of the VM doesn't help, only reboot of the whole system does.

I am running Arch 5.15.12-arch1. I am aware of #46 and have 'w /sys/bus/pci/devices/0000:05:00.0/reset_method - - - - device_specific' in tmpfiles.d and the module seems to work 'correctly':

systemd[1]: Started Virtual Machine qemu-1-win11.
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: version 1.1
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing pre-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x20594
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: Performing BACO reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing post-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: reset result = 0
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
kernel: vfio-pci 0000:05:00.0: vfio_ecap_init: hiding ecap 0x1e@0x370
kernel: vfio-pci 0000:05:00.1: enabling device (0000 -> 0002)
kernel: vfio-pci 0000:0f:00.3: enabling device (0000 -> 0002)
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: version 1.1
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing pre-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: CLOCK_CNTL: 0x0, PC: 0x2880
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: performing post-reset
kernel: vfio-pci 0000:05:00.0: AMD_POLARIS11: reset result = 0

Maybe the reset routine for Polaris is just incomplete?

@drujd
Copy link
Author

drujd commented Jan 2, 2022

OK, the issue stops manifesting when I DISABLE 'Above 4G decoding' in BIOS. Weird, some people with AMD cards reported that passthrough works for them only with it enabled... (And yes, I know resizeable BAR is not supported, that has always been off)

@cppmonkey
Copy link

Good to know!

Have an Asrock X570D4U (Ryzen 5700G) running Proxmox 7.1 (Kernel 5.13) and passing through 2x Radeon RX460 (Same chipset as your RX550).
Don't seem to have a reset issue. But passing a card through to a guest using DP, it would reset the host upon the DE loading.
Moving to using HDMI... the issue wen't away goes away. But I had to disable "Power Saving - Black Screen" or the guest would freeze.
Not sure if Above 4G decoding is enabled - I'll have to check

My desktop (Ryzen 9 3950X, Radeon RX 5600XT) seems to have a similar issue. DP results in the system randomly not waking up the screen. Have to login remotely to reboot the system. Using HDMI works fine, with the exception of the screen doesn't go to sleep.

Curious if you're system is Intel and AMD powered?

@drujd
Copy link
Author

drujd commented Jan 3, 2022

Asus X570-E
Ryzen 5950X
Vega 64 & RX550 (640SP) 4GiB

@cppmonkey
Copy link

Turns out Above 4G decoding was enabled on the X570D2U.
Started running a guest and passed both RX460's through - Worked fine for 30 mins and then GPU0 crashed locking up the system.
Halt and restart - 10 Mins stable
Halt and restart - 5 Mins stable

Given they take power from the PCIe interface, wonder if there is a power/heat issue. But GPU0 didn't feel especially hot

Been stable with a single card, only issue the screens wont go to sleep. Go off and instantly wake up.
Its interesting that you need to use this vendor-reset project, whilst I haven't needed to. However I am running a Linux guest and not a Windows.

Will give a live FC35 drive a go. See if the instability remains with 2x RX460's (and the AT2500, (Cezanne) Vega 8) GPUs

@drujd
Copy link
Author

drujd commented Jan 4, 2022

I don't think you have to use vendor-reset for Polaris cards as long as they gracefully shut down, but this project should allow them to recover from bad states caused by VM crashes, bad implementations of shut down procedure (in MacOs IIRC) etc.

Honestly, neither of your issues seems connected to the reset bug.

@bitshiftnetau
Copy link

Not sure if I'm facing the same issue exactly, but certainly the same symptoms as I'm sure you are facing. Windows 10 VM, Navi 23 RX6600 (currently not fully supported by this module afaik). Random shutdowns and then Proxmox requires a full system reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants