Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code 43 in guest when passing through NVIDIA GPU #34

Open
AnErrupTion opened this issue Aug 14, 2024 · 13 comments
Open

Code 43 in guest when passing through NVIDIA GPU #34

AnErrupTion opened this issue Aug 14, 2024 · 13 comments

Comments

@AnErrupTion
Copy link

Bug Description

When following the guide over here, adapting it to passthrough a dedicated GPU, a code 43 error can be observed after installing the GPU drivers in the guest system using Device Manager.

How to Reproduce

  1. Follow the previously linked guide, ensuring that:
  • vfio-pci is correctly bound to the GPU
  • The proper memlock modifications are done in /etc/security/limits.conf
  • The proper permissions are set throughout /dev/vfio/*
  • The VFIO device is attached to the guest using --attachvfio
  1. Install the GPU drivers, here using the latest NVIDIA 560.81 drivers
  2. Reboot and observe the code 43 error (NOTE: a pretty long ~5-6 seconds freeze can also be observed when booting up the VM. I'm assuming it tries to load the NVIDIA driver but fails to do so)

VM configuration

Guest OS configuration details:

  • Guest OS type and version (e.g. Windows 10 22H2): Windows 11 23H2
  • Attach guest VM configuration file from VirtualBox VMs/<guest VM name>/<guest VM name>.vbox: Windows 11.vbox.zip

Host OS details:

  • Host OS distribution: Arch Linux
  • Host OS kernel version: Linux shininglea 6.10.4-arch2-1 #1 SMP PREEMPT_DYNAMIC Sun, 11 Aug 2024 16:19:06 +0000 x86_64 GNU/Linux

Logs

@snue
Copy link
Contributor

snue commented Aug 14, 2024

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

@AnErrupTion
Copy link
Author

I see the split lock detection triggers in your dmesg log. That will cause issues for the VM, up to the point where it may not make any progress. I am not sure whether that is the root cause of your issue, but please try the recommendation from the README and see if it helps:

Starting with Intel Tiger Lake (11th Gen Core processors) or newer, split lock detection must be turned off in the host system. This can be achieved using the Linux kernel command line parameter split_lock_detect=off or using the split_lock_mitigate sysctl.

I was pretty sure I had already disabled it. But, either way, adding the command line parameter didn't do anything, although I now see this in dmesg:

 Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

@tpressure
Copy link
Contributor

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

@AnErrupTion
Copy link
Author

@snue is correct.

Here we have it

[ 2109.050169] x86/split lock detection: #AC: EMT-0/4675 took a split_lock trap at address: 0xfffff8021f251f4f

Unknown kernel command line parameters "split_lock_detect=off", will be passed to user space.

Yes, this is expected.

But I also see x86/split lock detection: disabled earlier in the log, so I'm assuming it's actually disabled now.

Sounds about right. Did it solve your issue?

Unfortunately, it didn't solve the issue.

@tpressure
Copy link
Contributor

@AnErrupTion can you post new logs with split lock disabled?

@AnErrupTion
Copy link
Author

Ah yes, my bad. Here they are:

dmesg.log
Windows 11-2024-08-14-17-15-07.log

@tpressure
Copy link
Contributor

tpressure commented Aug 14, 2024

It looks a little bit better and the guest is definitively trying to use the GPU:

00:00:07.099476 VFIO: RegisterBar 0xf0000000 
00:00:07.099500 VFIO: RegisterBar 0x800000000 
00:00:07.099501 VFIO: RegisterBar 0x900000000 
00:00:07.099503 VFIO: RegisterBar 0x6000 
00:00:07.099809 VFIO: Activate MSI count: 1

and

[   43.766761] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

Can you upload the output of lspci -vvvn please?

@AnErrupTion
Copy link
Author

I assume this card needs some kind of quirk. I can maybe look into this in a couple of weeks.

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Can you upload the output of lspci -vvvn please?

Alright, here's the output (when ran as root): lspci.log

@tpressure
Copy link
Contributor

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

@AnErrupTion
Copy link
Author

I'm not sure if it does, since passing through the same GPU with QEMU works just fine (no additional quirks needed or shenanigans).

Qemu automatically applies the necessary quirks when it detects a card that needs them

Is there a way of knowing which ones does it apply? I can fire up a QEMU VM if needed.

@AnErrupTion
Copy link
Author

Also, I guess I forgot to mention one interesting bit: when I went to check for updates in the VM, Windows Update did not download the NVIDIA driver and I had to download it manually (but then it installed fine afterwards). And, when I went to Device Manager, it said that the driver used is not the same one as the POSTed graphics driver, or something like this. None of this happened with QEMU either.

@snue
Copy link
Contributor

snue commented Aug 14, 2024

There are quite some nvidia quirks in QEMU.
The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/20180129202326.9417.71344.stgit@gimli.home/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

@AnErrupTion
Copy link
Author

AnErrupTion commented Aug 14, 2024

There are quite some nvidia quirks in QEMU. The quirky MSI handling is an obvious suspect, but so is the mirrored config space access in general. See this background discussion: https://patchwork.kernel.org/project/qemu-devel/patch/20180129202326.9417.71344.stgit@gimli.home/

Just maybe, you can force the GPU into legacy interrupt mode instead of MSI in the Windows VM to try and work around that?

I have tried to disable MSI by setting MSISupported in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\PCI\VEN_10DE&DEV_25A2&SUBSYS_13FC1043&REV_A1\3&267a616a&0&80\Device Parameters\Interrupt Management\MessageSignaledInterruptProperties to 0 instead of 1, but unfortunately, the problem still persists. One interesting thing though is that, in the utility I was using (MSI mode utility v3.1), my GPU doesn't actually appear on the list of devices, even though it's present in the registry and it also supports MSI (though that last part shouldn't matter because devices that don't support MSI also appear in the program's list):

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants