VMware vSphere / ESXi 7.0 GA work-around for GPU passthrough issues including disabled-after-reboot bug and UI bug
A new and improved workaround is available as of Jun 18 2020, details below.
When I was upgrading my primary Supermicro SuperServer Workstation / Datacenter, I ran into some strange problems with getting passthrough working. What would happen is that I'd get everything squared away, and boot my Windows 10 VM with my AMD Radeon 7750 GPU successfully passed through, as I've been doing for many years, see:
- What fits in any home virtualization lab, has 8 Xeon cores, 6 drives, 128 GB memory, and 3 4K outputs from a Windows 10 VM? Your new Supermicro SuperServer Workstation!
Jul 15 2015
I went all-in with my ESXi 7.0 upgrade, so in my most crucial (but back-up protected) Windows 10 VM, I also upgraded my virtual hardware to version 17, and updated my VMware tools as well. After a reboot of my ESXi host, I noticed my Windows 10 VM wouldn't boot up. Then I noticed the reason, it turns out my passthrough settings weren't persisting through reboots. This was nerve-wracking, as I had work the next morning and had to figure out a way to get things square again, without resorting to falling back to 6.7U3, and/or reverting to backups of 1.8TB of data.
Gladly, I found a work-around for my new ESX 7.0, warning, it's pretty wonky, but quick-and-easy. It's not permanent though, you have to do this after every ESXi host reboot. If you found a better way around this, by all means drop a comment below to let us all know!
Note that I currently have no valid way of reporting such bugs to VMware, still working on that. When a new dot zero release like vSphere 7.0 came out on April 2nd, and isn't on the VMware Compatibility Guide, at least not yet, opening a per-incident ticket isn't an option. I tried! I'll explain all that in another article soon.
Meanwhile, after a few dozen attempts and reboots, I found a workaround that I published a video of back on April 9 of 2020, and now this article will hopefully help others as well. Strangely enough, over 500 folks have seen that video already, so unfortunately, I suspect I'm not along with my issue. I hope the next patch release fixes this issue, which I've also posted to the VMTN forum.
New as of June 18 2020, and tested successfully in my home lab!
Note, this is not a fix, it's merely a stop-gap workaround until a hopefully much more elegant fix comes along. At least it persists, it's not something you have to do after every reboot, so that's good.
Follow the method shown in William Lam's new article Passthrough of Integrated GPU [iGPU] for standard Intel NUC, where he explains that the issue is about ESXi claiming the VGA driver, but beware, you will no longer see ESXi boot before your auto-started VM with GPU passthrough comes up! Here's the one-line SSH command to issue, then reboot, that's it!
esxcli system settings kernel set -s vga -v FALSE
Presumably, when a better resolution comes along in a subsequent ESXi 7.x release, we can issue this command to undo the change:
esxcli system settings kernel set -s vga -v TRUE
returning to the original ESXi 6.x behavior where your display shows your familiar black and yellow ESXi DCUI boot sequence, followed by a VGA hand-over to your GPU accelerated VM later on when that VM is auto-started.
Alternatively, you can work around this GPU mapping issue without changing anything via ESXi, but it's not sticky, so you'll need to do this UI operation after every ESXi reboot:
- In vSphere Client or ESXi Host Client, set both of your AMD GPU devices (video & audio) to passthrough
- Reboot the server
- After the reboot, if you use ESXi Host client and notice Passthrough status shows "Enabled/Needs reboot" instead of active, toggle both AMD devices off and then on again, you'll now see them both active, with no reboot required
- Now you can start your VM that uses the PCI device
- If you find your mappings are wrong and your VM still won't start, remove the PCI devices from the VM then re-add them again. This is covered in more detail in the video below.
I opened a VMware Service Request for this issue, as I explained here. It turns out VMware knows about this issue, has it documented for their support group, and plans to release a fix soon. That's all the information I'm able to share, but it's also all the information I've been given.
With vSphere 7.0 Update 1 released 3 days ago, I have some good news to report. It appears VMware has very much fixed this bug. That's right, issuing this command:
esxcli system settings kernel set -s vga -v TRUE
returns me to the factory behavior, and I don't have to reconfigure GPU passthrough after rebooting. This is good!
Now I'm just working through why I seem to sometimes get reverted to 7.0b, discussed here where I show this error:
See also this wonderful article [Roll back and downgrade VMware ESXi version](https://4sysops.com/archives/roll-back-and-downgrade-vmware-esxi-version/), which reveals the alternate ESXi version clearly present on my 32GB USB drive, seen in the screenshot at right. --- ##Screenshots --- ##See also at TinkerTry All vSphere 7 [articles](https://tinkertry.com/category:vSphere7). All vSphere 7 [videos](https://www.youtube.com/playlist?list=PLCuu-J0IWcS7jSDOY49wyoMoLN-mLB4HX). --- ##See also - **[Solved: Upgrade from 6.7 to 7.0 and unsupported hardware](https://communities.vmware.com/message/2941107#2941107)** Apr 10 2020 by zwbee at VMware Technology Network Forums Home > VMTN > VMware vSphere™ > ESXi > Discussions > I went to Host/Manage/Hardware and the HBA was showing up with the correct name. However, its passthrough setting had been switched to inactive. I toggled it to active, but there was an error in addition to the usual "Reboot required" message. I figured it didn't work, but I tried rebooting anyway. After reboot, the device now showed as passthrough active. Promising!
Shutting down firmware services...
Using 'simple offset' UEFI RTS mapping policy
Relocating modules and starting up the kernel...