Temporary workaround to recover from intermittent Intel X552/X557 10GbE network outages on 12 and 16 core Xeon D, hoping for a public firmware update fix

Posted by Paul Braren on Dec 26 2017 (updated on Jan 16 2018) in
  • Network
  • HowTo
  • If you really want to skip the interesting backstory, jump right down to the symptom, the workaround, and the proposed fix. This issue does not affect popular Xeon D systems with 8 or fewer cores.

    Backstory

    Yeah, it's a bit long, and somewhat complicated. At least I know it will feel good when this is all completely behind us!

    I'm glad that there are now hundreds of happy Supermicro SuperServer Bundle owners in the world who by-and-large have greatly enjoyed their product ownership experience, sharing a lot of positive experiences publicly, with under <0.5% of Bundle buyers returning their SuperServer to Wiredzone for any reason. I'm also glad to have had TinkerTry readers play an active role in bringing the even-more-capable 12 core Xeon D-1567 to market in the form of another Wiredzone Bundle. It ships already burn-in-tested and fully warranted, along with the latest tested BIOS and IPMI firmware already installed. This has made for a much-improved out-of-box experience for those eager to get right to work, and once the fix is available for this 10GbE issue, you can bet Wiredzone will quickly add that fix to their standard procedures.

    When it comes to 10GbE networking, it turns out the 12 core Xeon D's track record has been a little bumpy. It's quite possible that only a small proportion of those owners actually use their two 10GbE ports for 10G or 1G connectivity. It is with those folks in mind that I write this article, those most enthusiastic Xeon D fans that paid the premium for extra cores and a bigger heatsink/fan, expecting to enjoy nearly linear scaling.

    TinkerTry-labeled-10GbE-ports-on-back-of-Xeon-D-SuperServer-Bundle

    Some of those 12 core owners have unfortunately been experiencing an intermittent problem with network outages, where the physical link-layer LEDs go dark at some random time, for no apparent reason. This unfortunate link-down state currently has only one known recovery method. Shut down whatever OS you're running, then remove power from the system. That's an unacceptable "fix," more like a workaround really. This sort of power cycling can't be done remotely, at least if you don't happen to have a smart power strip already installed between your UPS and your SuperServer.

    First report - April 2017

    Back in April of 2017, a solitary report of some 10GbE strangeness arrived, documented so well by Devoid at TinkerTry here:

    Months back, I experienced the random disconnects of the 10G NICs. I did the BIOS and ICMP updates along with ESXi 6 U2 and the 4.4.1 x552 driver. Things had been running without issue for months, so I thought all was well. I recently decided to turn up some of my old VMs that I've had powered off. All was well for about 3 days. Now, it appears my 10G disconnecting NIC issue is back. I even just updated to 4.5.1 x552 driver...no luck. The kicker, no matter how many reboots, how many interface shutdowns (on both the switch and esxcli), and even pulling the network cable, nothing would bring up the links. I even tried hard setting speeds, nothing. The only solution, pull power to the X10SDV-12C-TLN4F. The huge problem this is causing me is that all my VM storage is on my synology NAS, via 10g (x540-T2 also hooked up to same Cisco switch, no issues). Once my supermicro 10g Links go down, all the VMs die.

    So. My questions: is anyone else experiencing this? Given that it ran fine for months with a few VMs, but came crashing down when I loaded it up, I'm wondering if it's linked to load?

    My setup:
    X10SDV-12C-TLN4F
    BIOS: 1.1c
    Firmware: 03.46
    ESXi 6.0.0, 4600944 (u2)
    Cisco 3750x, C3KX-NM-10GT
    Synology RS3413xs+
    HELP! I wouldn't even know who to talk to, Cisco, Supermicro, VMWare, Intel?

    I replied with good intentions, then David replied back again with good news:

    Update to the story. I haven't been able to open a VMWare ticket yet. Hopefully I will be able to through work, if needed.

    The good news, however, is that I did open a ticket with SuperMicro, and they responded pretty quickly, and suggested I update the firmware. They provided a specific firmware update for the 10G NICs. I'd pass is along, but it seems pretty specific. The firmware was labeled: SDV23A.

    So far, so good. 6+ days and counting. It has a decent load, so I plan to pile on a few more VMs, and keep monitoring.

    Second report - July 2017

    TinkerTry-Xeon-D-SuperServer-Cluster-featuring-Micron-NVMe-and-SSD-Supermicro-VMware-VSAN-demo-for-VMworld-2016.JPG
    My Netgear XS708T was in my basement, but my server was on my second floor. 100' of CAT7 and some attic adventures solved that problem.

    A few months later, another another report. I'll admit I didn't think too much of this second report, since an RMA swap resolved his issue. Without a 10GbE switch hooked up to my own Xeon D-1567 SuperServer Workstation Bundle 1, I didn't have a way to replicate the rarely encountered issue either. But I never forgot about it, and this incident motivated me to wire my home up for 10GbE. So I climbed into my sweltering hot attic in August to get myself some fresh 100' CAT7 cabling strung from my basement's Netgear XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch to my 2nd floor via the attic. Why? Well, I use my SuperServer Workstation near sleeping humans. While my Netgear XS708T was quieter than the Ubiquiti ES-16-XG switch I briefly tried (unboxing and testing), this switch was still far noisier than any of my Xeon D servers. That's why the switch stays in my basement, near my Xeon D-1541. At the time, I also hadn't heard about the Netgear XS708T's simple fan swap solution.

    The symptom

    Both of your 12 core Xeon D server's Intel X557 10GbE RJ45 port LEDs go dark at some seemingly random interval, a loss of link-layer. All LEDs go dark, the yellow link LED, and the green network speed LED. The frequency of these outages ranges from several times per day to once every few months. It can happen with whatever OS you're running, and seemingly random times, regardless of workload.

    Who might be affected

    Anybody with a 12 core Xeon D system using the X557 10GbE ports
    It's more complicated than that. To date, this issue seems to happen on:

    • Any OS
      I have reports of this problem occurring on:
      • VMware ESXi 6.0
      • VMware ESXi 6.5
      • XenServer 7.3
        That's not all OSs, but I don't have any reason to believe this doesn't happen on them, I likely just haven't received any reports of it happening on Windows yet.
    • Any 10G switch
      I have reports of this problem occurring on:
    • Any 12 core or 16 core Xeon D
      Presumably any brand of >12 core Xeon D system (there are many!), but I only first heard of this issue on Supermicro 12 core systems. This includes:
      • Xeon D-1557 featured on the X10SDV-12C-TLN4F motherboard as reported by Devoid here and discussed by phone recently
      • Xeon D-1567 featured on the X10SDV-12C+-WD002 motherboard (PIO-5028D-TN4T-01-WD002 in Windows Device Manager) that Wiredzone sells as part of the SYS-5028D-TN4T-12C SuperServer Bundle 1 and Bundle 2 system.
      • Xeon D-1577, Xeon D-1571, Xeon D-1559 likely affected too, see also the entire Intel Xeon Processor D Family (aka Broadwell DE) on Ark here.
    • Any network connection speed, 10GbE, and maybe 1GbE too
      • Presumably 1GbE links to the X557 network ports are also prone to this failure, but that's conjecture. Indications so far ar that this appears to be a firmware issue with the X557 itself.

    Failed workaround attempts

    I realize this is an odd section title, but when you read the bullet list, you'll start to gain a further understanding of why it has been challenging to get to the bottom of this issue.

    1. Power cycling the 10GbE network switch
    2. Upgrading firmware of the 10GbE switch
      I only tried this with my Netgear XS708T, it made no difference.
      I'm currently at 6.6.1.7, 1.0.0.8 level with 1.3.6.1.4.1.4526.100.4.39 System Object OID.
    3. Forcing different network negotiation methods in the device driver
      I only tried this these tweaks with the Intel driver VIBs would allow me to, under VMware ESXi 6.5U1
    4. Trying different CAT6a or CAT7 cables
    5. Trying different cable lengths
    6. Correlating OS events with network outage events, no obvious pattern after exploring syslog from Netgear and attached VMware ESXi 6.5U1 host, with the configuration of VMware vRealize Log Insight detailed at TinkerTry here.
    7. Applying the Intel X557 firmware SDV23A using Intel's SDVTLN4.BAT batch file on DOS bootable media fixed the issue for Devoid for a few months, but it didn't work for me. I encountering another outage in less than a day after the firmware upgrade, after a few weeks of uptime. I'm really not sure what this means, just not enough data yet. It's also possible my firmware update didn't complete successfully.

    Why am I unsure about the upgrade? It starts with my article:

    • How to check network driver and NIC firmware details in VMware ESXi
      used to find the following information for my Xeon D-1567, right after the SDV23A upgrade on my Xeon D-1567:

    • Xeon D-1567 (TinkerTry home lab 12/26/2017:)

       Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800005ad
           Version: 4.5.3-iov
    • Xeon D-1541 (TinkerTry home lab 12/26/2017):
      Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800003e7
           Version: 4.5.3-iov

    Yes, the firmware versions seem to differ. But do I know that SDV23A is supposed to give me 0x800005ad? Not entirely sure, and it doesn't show anywhere in my archive of all BIOS release notes.

    The workaround

    1. gracefully shut down your 12 core (or greater) system
    2. unplug the power cord for at least 15 second
    3. plug the power cord back in
    4. power up and boot your operating system up

    The workaround for VMware ESXi

    vSwitch1-Edit-Settings--TinkerTry

    This workaround won't prevent you from losing 10GbE connections, but it will allow an automatic fail-back to 1GbE for those occasions where powering down is very inconvenient.

    While you're likely using Intel I350 ETH0 for your service console, you can assign ETH1 to be your standby adapter.

    The fix

    1. Contact Supermicro's 24-Hour SuperServer Technical Support directly.
    2. Inform the technician that you're opening a service request for your 12 core Xeon D system because of Intel X557 10GbE networking issues, asking that they provide you with the firmware fix.
    3. Supermicro might insist you sign an NDA before they can share the fix with you, I've been told.
    4. If you get "push back," ask the technician to refer to Supermicro Service Record # SM1704244248 that was reported to TinkerTry readers here.
    5. Supermicro then sends you an Intel utility to flash the two X557 ports on the motherboard, I'm not sure what that level is.
    6. For customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    The better (future) fix

    A BIOS upgrade that also flashes both the Intel X557 (10G) and Intel I350 (1G) NICs, should it be confirmed that deploying those flash updates (without NDA) resolves these known issues.

    How failing fast might help everybody

    When troubleshooting such intermittent problems, it becomes important to find a way to make the problematic system fail fast. In other words, come up with an easy way to cause the problem without having to wait for it to happen naturally. Ideally, discovering a way for anybody to replicate both the entire system configuration, and the problem (network outage), on-demand, at-will. This would allow Supermicro a way to more easily recreate the issue themselves, the first step in getting a proper solution that anybody can apply to their own system. This solution would most likely be in the form of a new BIOS version, meanwhile, 1.2c is the latest BIOS currently available, see:

    As a blogger representing Supermicro owners like myself, I'm very reluctant to sign an NDA, much preferring to focus my energies on helping Supermicro find a solution that helps everybody anyway. For that to happen, these steps are likely needed first:

    1. A full recreate performed at Supermicro
    2. A fix is developed, presumably firmware
    3. Supermicro may need to coordinate with Intel for this
    4. QA testing of the fix done at Supermicro prior to GA release

    This all adds up to time. It will likely be weeks or even months before we have this fixed, and I'm very sorry about this temporary inconvenience. This article should help make that wait a little easier.

    Re-creation

    1. Install BIOS 1.2c and IPMI 3.58.
    2. Configure the BIOS exactly as shown here.
    3. Download free hypervisor ESXi 6.5.0a.
    4. Use iKVM to mount the ISO and install ESXi onto bootable USB media, such as on the readily available Sandisk.
    5. Once ESXi is configured, allow ssh (detailed in this article) then issue these two lines to download and install the latest Dec 04 2017 build 7388607 (helpful version history found here):
      esxcli software profile install -p ESXi-6.5.0-20171204001-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml
      reboot
    6. As usual with all 6.x builds of ESXi, you’ll notice no 10G drivers are working or even visible: the built-in VMware inbox drivers don’t work with X557, so you need to install the 4.5.3 VIB from here then
      reboot
    7. You may find that you now have no 10G connection at the physical level (no link LEDs on, on your 10G switch).
    8. This 10G network out problem can be resolved temporarily by shutting down and unplugging all system power for >15 seconds, then powering back up and booting ESXi back up, waiting for it to finish booting so the 10G driver loads and 10G speed indicator comes back on.

    Reference

    The network adapter part #s are seen on page 3 of Intel's document:

    You'll find the various products that share the same device driver:

    Product Codes: EZX557-AT, EZX557-AT2, and EZX557-AT4

    which are also listed by the Device ID that's found at various places in the vSphere GUIs:

    Table 1-2 Device ID
    X557 Device Vendor ID Device ID
    Intel® Ethernet Connection X557-AT (Single 19x19mm) 8086 0xB4A3
    Intel® Ethernet Connection X557-AT2 (Dual 19x19mm) 8086 0xB4C3
    Intel® Ethernet Connection X557-AT4 (Quad 25x25mm) 8086 0xB4B3

    Closing thoughts for tonight

    I'm working with Supermicro support directly to help get a fix to you, my valued TinkerTry reader who invested heavily in > 8 core Xeon D who demand stable 10GbE networking. It's now 10:22pm, and I we just finished a long phone call together working through all the details of this article.

    The 12 core Xeon D folks who are using the latest X557 firmware that Supermicro provides are reporting that their 10G network drop issues have gone away, and I've spoken to one such individual myself.


    Dec 27 2017 Update

    Typos and grammar cleaned up. More conversations with Supermicro planned soon, with future progress updates to be posted right here.

    FYI, there may also be a small issue for some 12 core Xeon D owners using their Intel I350 1GbE ports under ESXi, with ETH0 and ETH1 assignments occasionally swapped after ESXi 6.0 to 6.5 upgrades. This seems to be easily remedied by installing the supported driver for ESXi, as explained and shown here. It is odd how these 2 network issues only affect 12 core (and maybe 16 core) Xeon D owners. One really has to wonder how this could be, given how similar they are when compared with the 8 core Xeon D-1541.

    Why is an NDA for a firmware?

    I suspect Intel is requiring Supermicro to not widely distribute the firmware fix, and/or the DOS tool EEUPDATE that implements the fix. Maybe it's just beta, maybe there are legal restrictions, I don't know for sure, this is just conjecture. It's important to note that for customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    How does Xeon D-1541 differ from the Xeon D-1567?

    What I'm focused on here is things that differ that could affect the way the BIOS and IPMI are configured. The idea is to see if there's some good reason that the X557 would apparently begin to run into trouble only on systems with more cores.

    • Taller Heatsink - This assembly also includes a slightly bigger CPU fan.
      • This would seem to only cause Supermicro to slightly tweak the factory default fan speeds for the CPU fan header on the SoC/motherboard, which is very unlikely to change the temperatures of the physical X557 10G interfaces.
    • Slightly increased watt burn - Up to roughly 20% extra watts used versus 8 core models with the same clock speeds, and only when handling very heavy workloads.
      • Stress would seem to cause the system to come slightly closer to using about a third of the 250 watt power supply that the CSE-721TQ-250B comes with, the pieces that make up the SuperServer SYS-5028D-TN4T bare-bones systems.
      • It's hard to see how this would matter, especially since the X557 PHY should handle high ambient temps just fine, see the many ruggedized fanless designs. These network outages seem to happen during periods of inactivity/idle just as often as when the system is under load.
    • Slightly lower 2133MHz for up 128GB of ECC DDR4 - Intel's architectural restrictions mean that only the Xeon D 8 core design allows the system to negotiate 2400MHz DDR4 speeds at POST, as confirmed in the BIOS and explained here and here.
      • The 12 core, and all other Xeon D models (4, 6, and 16 core) negotiate 2133Mhz speeds, which is normal. The likelihood of noticing this speed difference during normal use is unlikely, perhaps up to a 5% difference that's likely only revealed with synthetic benchmarks.
        X10SDV-12C-TLN4F-cropped--TinkerTry
        Click to visit the Supermicro X10SDV-12C-TLN4F motherboard product page.
      • The advantages of an increased number of cores is a huge advantage for multi-threaded workloads. For pricing/marketplace reasons, Wiredzone cut over to 2400MHz for all SuperServer Bundles a long time ago, in the summer of 2016. I have no regrets in buying my 12 core system, I actively use mine hundreds of hours per month creating nearly all the content and videos here at TinkerTry.
      • It doesn't matter if you have 2 (included with Bundles) or 4 memory sticks installed, 2133MHz is the max you'll get if you have anything other than 8 cores in your Xeon D.
      • This Intel design restriction seems to be confirmed to be an industry-wide thing, seen on the various specs sheets here.

    I'm an optimist about this sort of thing, even issues that drag on as long as this one has. With so many companies making Intel Xeon D systems out there, and many designs enjoying at least 7 years of product life and support, Intel and Supermicro are highly motivated to resolve this issue. Intel has historically had very robust firmware, and their track record of many VMware/Intel X557 driver/VIB releases these past 2+ years demonstrates how active the Xeon D market continues to be.


    Jan 16 2018 Update

    TinkerTry is All-in!

    After weeks of failed X557 issue recreate at Supermicro HQ in San Jose CA, I'm shipping them my very own Xeon D-1567 SYS-5028D-TN4T to them. It just so happens to likely the very first such unit ever produced, but that's likely not relevant, as it appears 200 identical 12 core Xeon D-1567 motherboards were made in the same production run for Wiredzone. Since my system encounters the X557 issue within a day or two regardless of the workload, recreate at Supermicro shouldn't take long, and I'm even mailing them my Netgear XS708T switch too, just in case.

    How long it takes Supermicro and Intel to develop a proper fix is another matter, but I'm doing all I can to accelerate that process. This isn't a simple matter.

    Since my 12 core is my primary workstation & datacenter, to make this loan to Supermicro possible, TinkerTry.com, LLC has now invested in a 2nd identical
    Supermicro SuperServer Bundle 2 12 core. Once I have my drives and primary Windows 10 VM moved over, and a recreate accomplished on the exact configuration I intend to ship, I'll be able to send the affected system off to California. Special thanks to Wiredzone for helping with accelerated cross-country shipping, and to all my advertisers for making TinkerTry's 3rd Xeon D purchases possible. I stand behind anything I put my name, and reputation, behind.

    This new addition to my family will be a huge boon for my testing and staging and content creation, with one Xeon D-1541 and one Xeon D-1567 soon available to test and reboot, at will. Well, at least once my primary workstation returns, of course. Hopefully soon.

    I've also collected several Supermicro SM#s for them to investigate, from various TinkerTry readers who have been pitching in.

    Xeon D-1540 (8 core) and Xeon D-1587 (12 core) too?

    Note that the first-generation Xeon D was called the Xeon D-1540, and there is now one report of the same X557 network-down issue by verdragan in this article's comments. That is an interesting twist, perhaps this will give Supermicro and/or Intel some insight into root cause.

    You'll also noticed we now have a report from Xeon D-1587 owner takaze, with article above updated accordingly.

    Finally, there's a 6 SuperServer Bundle owner out there, 4 8 cores and 2 12 cores. He's only experienced these X557 issues on the 12 core systems. It breaks my heart to ask him to avoid 10G for now as a workaround, but I'm confident we'll get this resolved, without him having to sign an NDA or ship his system to Supermicro to update it for him.


    Jan 19 2018 Update

    Supermicro has been in communications with me about this, and I might not need to ship them my system after all, with many folks now involved. I don't have any significant new developments to share at this time, unfortunately.

    I've also reached out to an Intel spokesperson for assistance.


    See also at TinkerTry

    promise-sanlink3-t1-adapter-gives-thunderbolt-3-usb-pc-10gbe

    xeon-d-landscape-2017

    ubiquiti-mpower-pro-8-port-outlet-measures-watts

    a-good-look-at-the-worlds-first-16-core-supermicro-superserver-xeon-d-1587-thanks-canada

    how-to-install-intel-x552-vib-on-esxi-6-on-superserver-5028d-tn4t