Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looping Invalidation Time-out Error and possible fix #66

Open
BoyStot opened this issue Nov 4, 2022 · 0 comments
Open

Looping Invalidation Time-out Error and possible fix #66

BoyStot opened this issue Nov 4, 2022 · 0 comments

Comments

@BoyStot
Copy link

BoyStot commented Nov 4, 2022

Thanks for the work on this patch its really helpful.

I installed this on Proxmox 7.2 including a hookscript for adding the device_specific reset method and it fires whenever a VM uses my device, a Radeon Pro WX5100, and it is running as it should.

[ 892.924004] vfio-pci 0000:07:00.0: AMD_POLARIS10: version 1.1
[ 892.924013] vfio-pci 0000:07:00.0: AMD_POLARIS10: performing pre-reset
[ 892.943929] vfio-pci 0000:07:00.0: AMD_POLARIS10: performing reset
[ 892.943942] vfio-pci 0000:07:00.0: AMD_POLARIS10: CLOCK_CNTL: 0x0, PC: 0x2a44
[ 892.943945] vfio-pci 0000:07:00.0: AMD_POLARIS10: performing post-reset
[ 892.983937] vfio-pci 0000:07:00.0: AMD_POLARIS10: reset result = 0

However when using vendor-reset as the only reset method, if I stop my VM and restart it I would still get a bug where there were a continuous stream of DMAR errors which lock up the node and force me to do a cold boot.

[ 893.004144] DMAR: VT-d detected Invalidation Completion Error: SID 0
[ 893.004146] DMAR: QI HEAD: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 893.004148] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x185f6729c4
[ 893.004150] DMAR: Invalidation Completion Error (ICE) cleared
[ 893.004266] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[ 893.004268] DMAR: QI HEAD: UNKNOWN qw0 = 0x0, qw1 = 0x0
[ 893.004269] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x185f6729cc
[ 893.004272] DMAR: Invalidation Time-out Error (ITE) cleared

A lot of searching and testing different things and I found this pci reset script that when added to the post-stop hook in Proxmox allows the VM to stop and start without crashing the host.

I still get 3-4 of the DMAR errors on VM start, however they stop right away and the VM launches and I can use the GPU without issue.

Only thing that remains is that Proxmox hook scripts don't run when the VM guest does a reboot rather than shutdown, but vendor-reset does still reset the device in this case and I see the AMD_POLARIS10 messages.

Is there something that this script is doing that could be included in vendor-reset so that guest reboots could be made to work also?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant