Bug in some Intel Broadwell CPUs (Xeon D-1500 and Xeon E5/E7 v4) can result in VMware ESXi 5.5/6.0 PSODs, VMware KB 2146388 explains BIOS fix needed (FIX released Sep 14 2016)

Posted by Paul Braren on Sep 9 2016 (updated on Oct 21 2016) in
  • Virtualization
  • ESXi
  • CPU
  • Oct 17 2016 Update - BIOS 1.1c has been released to address this issue, details below. I haven't yet confirmed with owners whether this resolves the intermittent/rare issue. Original article below.

    Jun 29 2016 - First Xeon E PSOD report

    Xeon E5 2650v4 owner's ESXi 6.0 Build 3620759 PSOD, June 29, 2016.

    That's the day I received my first email about this intermittent PSODs (Purple Screen Of Death) on a Supermicro SuperServer with two Xeon E5-2650v4 processors inside. This email referral came to me from Iwan Hoogendoor (@i1wan) by way courtesy of Andreas Peetz (@VFrontDe) of VMware Front Experience. In case you didn't already know, Andreas is quite the VIB guru, and has undoubtedly helped vast numbers of homelabbers that need consumer-PC AHCI/SATA and Realtek drivers. You know, those drivers that went missing in VMware's 6.0 release, causing people's drives and NICs to disappear, until they found his simple fixes.

    TinkerTry's Paul Braren and VMware's Iwan Hoogendoorn at VMworld 2016 US Communities. This sort of in-person meetup is exactly why I value going to conferences and user groups.

    And Iwan, wow, incredible list of advanced certifications, works for VMware, and a super nice and incredibly helpful guy. See also links to his amazing giving-back work below. As you can see, I even had the honor of meeting Iwan recently, see also my recent My experience at VMworld 2016 US article.

    Unfortunately, Iwan and I were stumped, and no PSOD fix was found. Weeks went by, with a very determined Iwan continuing to reach out to his contacts...

    Jul 26 2016 - First Xeon D PSOD report

    Rob Maas, owner of a Supermicro SuperServer SYS-5028D-TN4T system, reported a similar PSOD. That certainly got even more of my attention, and it's the same system I own that I'm using daily. My gut was telling me that this problem was more likely an Intel issue, and not so much a Supermicro issue.

    Those gut impressions were just that, conjecture. We needed much more information before we could conclude much of anything about the problem, never mind figuring out a solution.

    Aug 19 2016 - VMware NSX

    Right through much of August, there was no acknowledgment to be found from Intel or VMware about this issue, anywhere. About all that was in common between the Xeon D and Xeon E stories were that they were both running VMware NSX. Since this issue appears to be a relatively rare and intermittent issue, it could take many days or even weeks to try to replicate. No way to force it to reliably fail-fast had been found yet either. He had even reached out to VMware support.

    The first reference linking to VMware's mention of Intel's PSOD-causing problem seems to this article:

    where Allan refers us right to VMware KB 2146388 published on August 19 2016,

    • ESXi host fails with purple diagnostic screen when using Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families (2146388)

    When running a supported version of ESXi (ESXi 5.5 U3b and ESXi 6.0 U1b or later) with Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families you may experience these symptoms:

    • The ESXi host fails with this purple diagnostic screen (PSOD)
    • The backtrace information is missing in the screen
    • In the PSOD screen, you see error similar to:

    2016-07-27T13:14:04.549Z cpu58:42053)@BlueScreen: #PF Exception 14 in world 42053:vmm7:My_VM IP 0x410016bb8000 addr 0x410016bb8000 PTEs:0x10001c023;0x8000010023;0x80000e5023;0x800000408841e063;
    2016-07-27T13:14:04.549Z cpu58:42053)Code start: 0x418018000000 VMK uptime: 2:04:41:19.840
    2016-07-27T13:14:04.552Z cpu58:42053)base fs=0x0 gs=0x41804e800000 Kgs=0x0

    That's the very same article that Iwan recently tipped me off to. Now we're getting somewhere!

    I've been in close touch with a very helpful Supermicro contact throughout this week, and they're currently working closely with Intel on a BIOS update fix for all their Xeon D-1500 systems. This means Xeon D-1518, Xeon D-1528, Xeon D-1541, Xeon D-1587, and soon, Xeon D-1567 too. Seems likely they're working on Xeon E as well, but I don't have confirmation of that.

    When I continued to read that KB article, I felt some relief, then noticed this section:

    This is a known hardware issue affecting supported versions of ESXi (ESXi 5.5 U3b and ESXi 6.0 U1b or later) with Intel Xeon Processor E5 v4, E7 v4, and D-1500 Families.

    This is a not a VMware issue. To resolve this issue, upgrade the system BIOS (firmware) to a version which provides the following microcode patch revision level for the associated Intel Xeon processors:

    (see table in the KB)

    Hmmm. This sure sounds more like a BIOS software workaround, rather than an actual fix. I realize CPU hardware ain't exactly easy to fix. But hey, if the new BIOS works, I guess we'll need to be OK with that. It's not like this sort of thing hasn't happened before.

    If you experience this PSOD


    If you are a SuperServer owner experiencing this PSOD issue and your need to work around this problem is urgent, you should contact 24-Hour SuperServer Technical Support at SuperMmicro, and mention this VMwareKB 2146388. Perhaps they can get you a beta BIOS code workaround. No promises, just a suggestion. See also my Disclaimer at bottom-left.

    It could also be helpful if affected Xeon owners drop a comment below, letting others know the circumstances under which the PSOD surfaced in your system, and what BIOS level you're currently at. The more details, the better, for all involved.

    These were my first reports of any PSODs on Xeon D 1500

    I myself have not run into this PSOD issue, but then again, I'm not nesting 18 copies of ESXi 6.0 to test NSX either. Not saying that nesting is the reason, but it may be that it takes some stress to get this issue to surface. Don't know yet.

    I've worked on 7 different Xeon D-1540/1541 systems myself these past 13 months. Just last week, I demonstrated all Xeon D CPUs that Supermicro currently uses at VMworld 2016, see also my comparison chart. I can safely say that I have never experienced a PSOD, even under extreme abuse, like running a 16 vCPU Win 10 VM assigned to 60GB or 64GB physical memory, running Prime95 to measure watts and decibels.

    I've also had no reports of PSODs from any of the hundreds of happy Wiredzone Bundle owners either. Well, not until August 2016 that is. With thousands of Disqus comments left here below hundreds of TinkerTry articles to date, there have actually been NO reports of PSODs from any Xeon D-1500 owners.

    Andreas' gut told him he should make introductions between me and Iwan, just in case my experience with Xeon D were to somehow help out here. I'm so glad he did. Seems Andreas was right, this nasty (but luckily rare?) PSOD can apparently affect both of these Broadwell-based Xeon families.

    There is no supported workaround as we wait for a new BIOS

    Of course, I'm taking this matter very seriously. I will continue to freely share the most up to date information about this issue as I can, just as I did for the Intel SR-IOV on Xeon D-1540 issue that I convinced Supermicro to clarify right on their SYS-5028D-TN4T/Xeon D-1541 product page, and as I did for the lack-of-10GbE-driver-for-ESXi issue that was resolved back in November of 2015.

    Closing thoughts

    It would appear all the right companies are taking this matter seriously, and it should only be a matter of weeks before a solution from Supermicro is published. Of course, you can count on hearing about the fix, and the BIOS upgrade/testing that my new friends out there are doing with NSX, right here at TinkerTry. Special thanks also go out to Rob Maas, the owner of a SYS-5028D-TN4T with the Xeon D-1540 CPU, kindly pitching in a lot of time and effort on documenting the issue, and testing potential fixes, as we all work with Supermicro on this pressing matter.

    Oct 17 2016

    Click on the image to start the download of BIOS 1.1c X10SDVF6_A03.zip straight from Supermicro, it contains the BIOS X10SDVF6.A03 file.

    BIOS 1.1c is now available for download.

    Alternative ways to get to the download site, and to read the EULA, are to start with either of the product pages:

    then click on the BIOS link found at any of those pages.

    I haven't yet confirmed with other owners who experienced PSODs whether or not this release resolves their intermittent/rare issues. that may take some weeks to determine with any certainty. I'm also trying to get my hands on the BIOS 1.1c release notes.

    If you prefer to not use Rufus to create the bootable USB key for the BIOS upgrade using the method detailed here:

    you can also use the Supermicro Update Manager or IPMI, as described here:

    Oct 19 2016 Update

    I received these second-hand today, unconfirmed and unedited:
    BIOS 1.1c Release Notes

    1. Update RC 2.3.0
    2. Update SPS to
    3. Update microcode M1050664_0F00000A.
    4. Prompt warning message to prevent user disable EHCI when XHCI in Auto/Smart Auto.
    5. Support F12 hotkey attempt to boot from onboard LANs orderly.
    6. Integrate Supermicro default key for secure boot.
    7. System hang at POST 0xA2 during PCH on/off stress test.
    8. Update ACM 1.3.0 PW and microcode M1050663_0700000C.
    9. Expose all supported bifurcation combination for PCIe slot 7.

    Oct 21 2016 Update

    Supermicro Xeon D SuperServer BIOS upgrade to 1.1c performed over IPMI Web UI

    Here's the Supermicro Bios Update video I mention in the video.

    Next, the article that is credited with describing the technique shown in this video:

    1) The first choice for first Xeon D owners is to request an evaluation license that's tied to a single system, and only one per customer:

    2) If you want to keep using SUM, get a license for each system, for under $20 USD:

    I've been told these licenses are:

    • tied to just that system (married to the BMC that was provided)
    • perpetual (don't expire)
    • transferrable (new owner of the hardware can continue to use the license)

    See also at TinkerTry

    See also