Bug in some Intel Broadwell CPUs (Xeon D-1500 and Xeon E5/E7 v4) can result in VMware ESXi 5.5/6.0 PSODs, VMware KB 2146388 explains BIOS fix needed (FIX released Sep 14 2016)
That's the day I received my first email about this intermittent PSODs (Purple Screen Of Death) on a Supermicro SuperServer with two Xeon E5-2650v4 processors inside. This email referral came to me from Iwan Hoogendoor (@i1wan) by way courtesy of Andreas Peetz (@VFrontDe) of VMware Front Experience. In case you didn't already know, Andreas is quite the VIB guru, and has undoubtedly helped vast numbers of homelabbers that need consumer-PC AHCI/SATA and Realtek drivers. You know, those drivers that went missing in VMware's 6.0 release, causing people's drives and NICs to disappear, until they found his simple fixes.
And Iwan, wow, incredible list of advanced certifications, works for VMware, and a super nice and incredibly helpful guy. See also links to his amazing giving-back work below. As you can see, I even had the honor of meeting Iwan recently, see also my recent My experience at VMworld 2016 US article.
Unfortunately, Iwan and I were stumped, and no PSOD fix was found. Weeks went by, with a very determined Iwan continuing to reach out to his contacts...
Rob Maas, owner of a Supermicro SuperServer SYS-5028D-TN4T system, reported a similar PSOD. That certainly got even more of my attention, and it's the same system I own that I'm using daily. My gut was telling me that this problem was more likely an Intel issue, and not so much a Supermicro issue.
Those gut impressions were just that, conjecture. We needed much more information before we could conclude much of anything about the problem, never mind figuring out a solution.
Right through much of August, there was no acknowledgment to be found from Intel or VMware about this issue, anywhere. About all that was in common between the Xeon D and Xeon E stories were that they were both running VMware NSX. Since this issue appears to be a relatively rare and intermittent issue, it could take many days or even weeks to try to replicate. No way to force it to reliably fail-fast had been found yet either. He had even reached out to VMware support.
The first reference linking to VMware's mention of Intel's PSOD-causing problem seems to this article:
- Intel Xeon CPU E5-26xx v4 PSOD
Aug 26 2016 by Allan Kjaer at Virtual Allan
where Allan refers us right to VMware KB 2146388 published on August 19 2016,
- ESXi host fails with purple diagnostic screen when using Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families (2146388)
When running a supported version of ESXi (ESXi 5.5 U3b and ESXi 6.0 U1b or later) with Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families you may experience these symptoms:
- The ESXi host fails with this purple diagnostic screen (PSOD)
- The backtrace information is missing in the screen
- In the PSOD screen, you see error similar to:
2016-07-27T13:14:04.549Z cpu58:42053)@BlueScreen: #PF Exception 14 in world 42053:vmm7:My_VM IP 0x410016bb8000 addr 0x410016bb8000 PTEs:0x10001c023;0x8000010023;0x80000e5023;0x800000408841e063;
2016-07-27T13:14:04.549Z cpu58:42053)Code start: 0x418018000000 VMK uptime: 2:04:41:19.840
2016-07-27T13:14:04.552Z cpu58:42053)base fs=0x0 gs=0x41804e800000 Kgs=0x0
That's the very same article that Iwan recently tipped me off to. Now we're getting somewhere!
I've been in close touch with a very helpful Supermicro contact throughout this week, and they're currently working closely with Intel on a BIOS update fix for all their Xeon D-1500 systems. This means Xeon D-1518, Xeon D-1528, Xeon D-1541, Xeon D-1587, and soon, Xeon D-1567 too. Seems likely they're working on Xeon E as well, but I don't have confirmation of that.
When I continued to read that KB article, I felt some relief, then noticed this section:
This is a known hardware issue affecting supported versions of ESXi (ESXi 5.5 U3b and ESXi 6.0 U1b or later) with Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families.
This is a not a VMware issue. To resolve this issue, upgrade the system BIOS (firmware) to a version which provides the following microcode patch revision level for the associated Intel Xeon processors:
(see table in the KB)
Hmmm. This sure sounds more like a BIOS software workaround, rather than an actual fix. I realize CPU hardware ain't exactly easy to fix. But hey, if the new BIOS works, I guess we'll need to be OK with that. It's not like this sort of thing hasn't happened before.
If you are a SuperServer owner experiencing this PSOD issue and your need to work around this problem is urgent, you should contact 24-Hour SuperServer Technical Support at SuperMmicro, and mention this VMwareKB 2146388. Perhaps they can get you a beta BIOS code workaround. No promises, just a suggestion. See also my Disclaimer at bottom-left.
It could also be helpful if affected Xeon owners drop a comment below, letting others know the circumstances under which the PSOD surfaced in your system, and what BIOS level you're currently at. The more details, the better, for all involved.
I myself have not run into this PSOD issue, but then again, I'm not nesting 18 copies of ESXi 6.0 to test NSX either. Not saying that nesting is the reason, but it may be that it takes some stress to get this issue to surface. Don't know yet.
I've worked on 7 different Xeon D-1540/1541 systems myself these past 13 months. Just last week, I demonstrated all Xeon D CPUs that Supermicro currently uses at VMworld 2016, see also my comparison chart. I can safely say that I have never experienced a PSOD, even under extreme abuse, like running a 16 vCPU Win 10 VM assigned to 60GB or 64GB physical memory, running Prime95 to measure watts and decibels.
I've also had no reports of PSODs from any of the hundreds of happy Wiredzone Bundle owners either. Well, not until August 2016 that is. With thousands of Disqus comments left here below hundreds of TinkerTry articles to date, there have actually been NO reports of PSODs from any Xeon D-1500 owners.
Andreas' gut told him he should make introductions between me and Iwan, just in case my experience with Xeon D were to somehow help out here. I'm so glad he did. Seems Andreas was right, this nasty (but luckily rare?) PSOD can apparently affect both of these Broadwell-based Xeon families.
Of course, I'm taking this matter very seriously. I will continue to freely share the most up to date information about this issue as I can, just as I did for the Intel SR-IOV on Xeon D-1540 issue that I convinced Supermicro to clarify right on their SYS-5028D-TN4T/Xeon D-1541 product page, and as I did for the lack-of-10GbE-driver-for-ESXi issue that was resolved back in November of 2015.
It would appear all the right companies are taking this matter seriously, and it should only be a matter of weeks before a solution from Supermicro is published. Of course, you can count on hearing about the fix, and the BIOS upgrade/testing that my new friends out there are doing with NSX, right here at TinkerTry. Special thanks also go out to Rob Maas, the owner of a SYS-5028D-TN4T with the Xeon D-1540 CPU, kindly pitching in a lot of time and effort on documenting the issue, and testing potential fixes, as we all work with Supermicro on this pressing matter.
- Recommended BIOS Settings for Supermicro SuperServer SYS-5028D-TN4T
Jan 15 2016
How to deploy Windows Nano Server (TP5) on vSphere
Jul 15 2016 by Andreas Peetz at VMware Front Experience
Intel Skylake bug causes PCs to freeze during complex workloads
Jan 11 2016 by Mark Walton at ARS Technica
- VCIX-NV Video Study Guide