Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue

Posted by Paul Braren on Dec 26 2017 (updated on Jul 25 2018) in
  • Network
  • HowTo
  • If you really want to skip the interesting backstory, jump right down to the symptom, the workaround, the proposed fix, and the actual fix done at Supermicro. This issue never affected the popular Xeon D-1541 systems with 8 cores.

    Backstory

    Yeah, it's a bit long, and somewhat complicated. At least I know it will feel good when this is all completely behind us!

    I'm glad that there are now hundreds of happy Supermicro SuperServer Bundle owners in the world who by-and-large have greatly enjoyed their product ownership experience, sharing a lot of positive experiences publicly, with under <0.5% of Bundle buyers returning their SuperServer to Wiredzone for any reason. I'm also glad to have had TinkerTry readers play an active role in bringing the even-more-capable 12 core Xeon D-1567 to market in the form of another Wiredzone Bundle. It ships already burn-in-tested and fully warranted, along with the latest tested BIOS and IPMI firmware already installed. This has made for a much-improved out-of-box experience for those eager to get right to work, and once the fix is available for this 10GbE issue, you can bet Wiredzone will quickly add that fix to their standard procedures.

    When it comes to 10GbE networking, it turns out the 12 core Xeon D's track record has been a little bumpy. It's quite possible that only a small proportion of those owners actually use their two 10GbE ports for 10G or 1G connectivity. It is with those folks in mind that I write this article, those most enthusiastic Xeon D fans that paid the premium for extra cores and a bigger heatsink/fan, expecting to enjoy nearly linear scaling.

    TinkerTry-labeled-10GbE-ports-on-back-of-Xeon-D-SuperServer-Bundle

    Some of those 12 core owners have unfortunately been experiencing an intermittent problem with network outages, where the physical link-layer LEDs go dark at some random time, for no apparent reason. This unfortunate link-down state currently has only one known recovery method. Shut down whatever OS you're running, then remove power from the system. That's an unacceptable "fix," more like a workaround really. This sort of power cycling can't be done remotely, at least if you don't happen to have a smart power strip already installed between your UPS and your SuperServer.

    First report - April 2017

    Back in April of 2017, a solitary report of some 10GbE strangeness arrived, documented so well by Devoid at TinkerTry here:

    Months back, I experienced the random disconnects of the 10G NICs. I did the BIOS and ICMP updates along with ESXi 6 U2 and the 4.4.1 x552 driver. Things had been running without issue for months, so I thought all was well. I recently decided to turn up some of my old VMs that I've had powered off. All was well for about 3 days. Now, it appears my 10G disconnecting NIC issue is back. I even just updated to 4.5.1 x552 driver...no luck. The kicker, no matter how many reboots, how many interface shutdowns (on both the switch and esxcli), and even pulling the network cable, nothing would bring up the links. I even tried hard setting speeds, nothing. The only solution, pull power to the X10SDV-12C-TLN4F. The huge problem this is causing me is that all my VM storage is on my synology NAS, via 10g (x540-T2 also hooked up to same Cisco switch, no issues). Once my supermicro 10g Links go down, all the VMs die.

    So. My questions: is anyone else experiencing this? Given that it ran fine for months with a few VMs, but came crashing down when I loaded it up, I'm wondering if it's linked to load?

    My setup:
    X10SDV-12C-TLN4F
    BIOS: 1.1c
    Firmware: 03.46
    ESXi 6.0.0, 4600944 (u2)
    Cisco 3750x, C3KX-NM-10GT
    Synology RS3413xs+
    HELP! I wouldn't even know who to talk to, Cisco, Supermicro, VMWare, Intel?

    I replied with good intentions, then David replied back again with good news:

    Update to the story. I haven't been able to open a VMWare ticket yet. Hopefully I will be able to through work, if needed.

    The good news, however, is that I did open a ticket with SuperMicro, and they responded pretty quickly, and suggested I update the firmware. They provided a specific firmware update for the 10G NICs. I'd pass is along, but it seems pretty specific. The firmware was labeled: SDV23A.

    So far, so good. 6+ days and counting. It has a decent load, so I plan to pile on a few more VMs, and keep monitoring.

    Second report - July 2017

    TinkerTry-Xeon-D-SuperServer-Cluster-featuring-Micron-NVMe-and-SSD-Supermicro-VMware-VSAN-demo-for-VMworld-2016.JPG
    My Netgear XS708T was in my basement, but my server was on my second floor. 100' of CAT7 and some attic adventures solved that problem.

    A few months later, another another report. I'll admit I didn't think too much of this second report, since an RMA swap resolved his issue. Without a 10GbE switch hooked up to my own Xeon D-1567 SuperServer Workstation Bundle 1, I didn't have a way to replicate the rarely encountered issue either. But I never forgot about it, and this incident motivated me to wire my home up for 10GbE. So I climbed into my sweltering hot attic in August to get myself some fresh 100' CAT7 cabling strung from my basement's Netgear XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch to my 2nd floor via the attic. Why? Well, I use my SuperServer Workstation near sleeping humans. While my Netgear XS708T was quieter than the Ubiquiti ES-16-XG switch I briefly tried (unboxing and testing), this switch was still far noisier than any of my Xeon D servers. That's why the switch stays in my basement, near my Xeon D-1541. At the time, I also hadn't heard about the Netgear XS708T's simple fan swap solution.

    The symptom

    Both of your 12 core Xeon D server's Intel X557 10GbE RJ45 port LEDs go dark at some seemingly random interval, a loss of link-layer. All LEDs go dark, the yellow link LED, and the green network speed LED. The frequency of these outages ranges from several times per day to once every few months. It can happen with whatever OS you're running, and seemingly random times, regardless of workload.

    Who might be affected

    Anybody with a 12 core Xeon D system using the X557 10GbE ports
    It's more complicated than that. To date, this issue seems to happen on:

    • Any OS
      I have reports of this problem occurring on:
      • VMware ESXi 6.0
      • VMware ESXi 6.5
      • XenServer 7.3
        That's not all OSs, but I don't have any reason to believe this doesn't happen on them, I likely just haven't received any reports of it happening on Windows yet.
    • Any 10G switch
      I have reports of this problem occurring on:
    • Any 12 core or 16 core Xeon D
      Presumably any brand of >12 core Xeon D system (there are many!), but I only first heard of this issue on Supermicro 12 core systems. This includes:
      • Xeon D-1557 featured on the X10SDV-12C-TLN4F motherboard as reported by Devoid here and discussed by phone recently
      • Xeon D-1567 featured on the X10SDV-12C+-WD002 motherboard (PIO-5028D-TN4T-01-WD002 in Windows Device Manager) that Wiredzone sells as part of the SYS-5028D-TN4T-12C SuperServer Bundle 1 and Bundle 2 system.
      • Xeon D-1577, Xeon D-1571, Xeon D-1559 likely affected too, see also the entire Intel Xeon Processor D Family (aka Broadwell DE) on Ark here.
    • Any network connection speed, 10GbE, and maybe 1GbE too
      • Presumably 1GbE links to the X557 network ports are also prone to this failure, but that's conjecture. Indications so far ar that this appears to be a firmware issue with the X557 itself.

    Failed workaround attempts

    I realize this is an odd section title, but when you read the bullet list, you'll start to gain a further understanding of why it has been challenging to get to the bottom of this issue.

    1. Power cycling the 10GbE network switch
    2. Upgrading firmware of the 10GbE switch
      I only tried this with my Netgear XS708T, it made no difference.
      I'm currently at 6.6.1.7, 1.0.0.8 level with 1.3.6.1.4.1.4526.100.4.39 System Object OID.
    3. Forcing different network negotiation methods in the device driver
      I only tried this these tweaks with the Intel driver VIBs would allow me to, under VMware ESXi 6.5U1
    4. Trying different CAT6a or CAT7 cables
    5. Trying different cable lengths
    6. Correlating OS events with network outage events, no obvious pattern after exploring syslog from Netgear and attached VMware ESXi 6.5U1 host, with the configuration of VMware vRealize Log Insight detailed at TinkerTry here.
    7. Applying the Intel X557 firmware SDV23A using Intel's SDVTLN4.BAT batch file on DOS bootable media fixed the issue for Devoid for a few months, but it didn't work for me. I encountering another outage in less than a day after the firmware upgrade, after a few weeks of uptime. I'm really not sure what this means, just not enough data yet. It's also possible my firmware update didn't complete successfully.

    Why am I unsure about the upgrade? It starts with my article:

    • How to check network driver and NIC firmware details in VMware ESXi
      used to find the following information for my Xeon D-1567, right after the SDV23A upgrade on my Xeon D-1567:

    • Xeon D-1567 (TinkerTry home lab 12/26/2017:)

       Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800005ad
           Version: 4.5.3-iov
    • Xeon D-1541 (TinkerTry home lab 12/26/2017):
      Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800003e7
           Version: 4.5.3-iov

    Yes, the firmware versions seem to differ. But do I know that SDV23A is supposed to give me 0x800005ad? Not entirely sure, and it doesn't show anywhere in my archive of all BIOS release notes.

    The workaround

    1. gracefully shut down your 12 core (or greater) system
    2. unplug the power cord for at least 15 second
    3. plug the power cord back in
    4. power up and boot your operating system up

    The workaround for VMware ESXi

    vSwitch1-Edit-Settings--TinkerTry

    This workaround won't prevent you from losing 10GbE connections, but it will allow an automatic fail-back to 1GbE for those occasions where powering down is very inconvenient.

    While you're likely using Intel I350 ETH0 for your service console, you can assign ETH1 to be your standby adapter.

    The fix

    1. Contact Supermicro's 24-Hour SuperServer Technical Support directly.
    2. Inform the technician that you're opening a service request for your 12 core Xeon D system because of Intel X557 10GbE networking issues, asking that they provide you with the firmware fix.
    3. Supermicro might insist you sign an NDA before they can share the fix with you, I've been told.
    4. If you get "push back," ask the technician to refer to Supermicro Service Record # SM1704244248 that was reported to TinkerTry readers here.
    5. Supermicro then sends you an Intel utility to flash the two X557 ports on the motherboard, I'm not sure what that level is.
    6. For customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    The better (future) fix

    A BIOS upgrade that also flashes both the Intel X557 (10G) and Intel I350 (1G) NICs, should it be confirmed that deploying those flash updates (without NDA) resolves these known issues.

    How failing fast might help everybody

    When troubleshooting such intermittent problems, it becomes important to find a way to make the problematic system fail fast. In other words, come up with an easy way to cause the problem without having to wait for it to happen naturally. Ideally, discovering a way for anybody to replicate both the entire system configuration, and the problem (network outage), on-demand, at-will. This would allow Supermicro a way to more easily recreate the issue themselves, the first step in getting a proper solution that anybody can apply to their own system. This solution would most likely be in the form of a new BIOS version, meanwhile, 1.2c is the latest BIOS currently available, see:

    As a blogger representing Supermicro owners like myself, I'm very reluctant to sign an NDA, much preferring to focus my energies on helping Supermicro find a solution that helps everybody anyway. For that to happen, these steps are likely needed first:

    1. A full recreate performed at Supermicro
    2. A fix is developed, presumably firmware
    3. Supermicro may need to coordinate with Intel for this
    4. QA testing of the fix done at Supermicro prior to GA release

    This all adds up to time. It will likely be weeks or even months before we have this fixed, and I'm very sorry about this temporary inconvenience. This article should help make that wait a little easier.

    Re-creation

    1. Install BIOS 1.2c and IPMI 3.58.
    2. Configure the BIOS exactly as shown here.
    3. Download free hypervisor ESXi 6.5.0a.
    4. Use iKVM to mount the ISO and install ESXi onto bootable USB media, such as on the readily available Sandisk.
    5. Once ESXi is configured, allow ssh (detailed in this article) then issue these two lines to download and install the latest Dec 04 2017 build 7388607 (helpful version history found here):
      esxcli software profile install -p ESXi-6.5.0-20171204001-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml
      reboot
    6. As usual with all 6.x builds of ESXi, you’ll notice no 10G drivers are working or even visible: the built-in VMware inbox drivers don’t work with X557, so you need to install the 4.5.3 VIB from here then
      reboot
    7. You may find that you now have no 10G connection at the physical level (no link LEDs on, on your 10G switch).
    8. This 10G network out problem can be resolved temporarily by shutting down and unplugging all system power for >15 seconds, then powering back up and booting ESXi back up, waiting for it to finish booting so the 10G driver loads and 10G speed indicator comes back on.

    Reference

    The network adapter part #s are seen on page 3 of Intel's document:

    You'll find the various products that share the same device driver:

    Product Codes: EZX557-AT, EZX557-AT2, and EZX557-AT4

    which are also listed by the Device ID that's found at various places in the vSphere GUIs:

    Table 1-2 Device ID
    X557 Device Vendor ID Device ID
    Intel® Ethernet Connection X557-AT (Single 19x19mm) 8086 0xB4A3
    Intel® Ethernet Connection X557-AT2 (Dual 19x19mm) 8086 0xB4C3
    Intel® Ethernet Connection X557-AT4 (Quad 25x25mm) 8086 0xB4B3

    Closing thoughts for tonight

    I'm working with Supermicro support directly to help get a fix to you, my valued TinkerTry reader who invested heavily in > 8 core Xeon D who demand stable 10GbE networking. It's now 10:22pm, and I we just finished a long phone call together working through all the details of this article.

    The 12 core Xeon D folks who are using the latest X557 firmware that Supermicro provides are reporting that their 10G network drop issues have gone away, and I've spoken to one such individual myself.


    Dec 27 2017 Update

    Typos and grammar cleaned up. More conversations with Supermicro planned soon, with future progress updates to be posted right here.

    FYI, there may also be a small issue for some 12 core Xeon D owners using their Intel I350 1GbE ports under ESXi, with ETH0 and ETH1 assignments occasionally swapped after ESXi 6.0 to 6.5 upgrades. This seems to be easily remedied by installing the supported driver for ESXi, as explained and shown here. It is odd how these 2 network issues only affect 12 core (and maybe 16 core) Xeon D owners. One really has to wonder how this could be, given how similar they are when compared with the 8 core Xeon D-1541.

    Why is an NDA for a firmware?

    I suspect Intel is requiring Supermicro to not widely distribute the firmware fix, and/or the DOS tool EEUPDATE that implements the fix. Maybe it's just beta, maybe there are legal restrictions, I don't know for sure, this is just conjecture. It's important to note that for customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    How does Xeon D-1541 differ from the Xeon D-1567?

    What I'm focused on here is things that differ that could affect the way the BIOS and IPMI are configured. The idea is to see if there's some good reason that the X557 would apparently begin to run into trouble only on systems with more cores.

    • Taller Heatsink - This assembly also includes a slightly bigger CPU fan.
      • This would seem to only cause Supermicro to slightly tweak the factory default fan speeds for the CPU fan header on the SoC/motherboard, which is very unlikely to change the temperatures of the physical X557 10G interfaces.
    • Slightly increased watt burn - Up to roughly 20% extra watts used versus 8 core models with the same clock speeds, and only when handling very heavy workloads.
      • Stress would seem to cause the system to come slightly closer to using about a third of the 250 watt power supply that the CSE-721TQ-250B comes with, the pieces that make up the SuperServer SYS-5028D-TN4T bare-bones systems.
      • It's hard to see how this would matter, especially since the X557 PHY should handle high ambient temps just fine, see the many ruggedized fanless designs. These network outages seem to happen during periods of inactivity/idle just as often as when the system is under load.
    • Slightly lower 2133MHz for up 128GB of ECC DDR4 - Intel's architectural restrictions mean that only the Xeon D 8 core design allows the system to negotiate 2400MHz DDR4 speeds at POST, as confirmed in the BIOS and explained here and here.
      • The 12 core, and all other Xeon D models (4, 6, and 16 core) negotiate 2133Mhz speeds, which is normal. The likelihood of noticing this speed difference during normal use is unlikely, perhaps up to a 5% difference that's likely only revealed with synthetic benchmarks.
        X10SDV-12C-TLN4F-cropped--TinkerTry
        Click to visit the Supermicro X10SDV-12C-TLN4F motherboard product page.
      • The advantages of an increased number of cores is a huge advantage for multi-threaded workloads. For pricing/marketplace reasons, Wiredzone cut over to 2400MHz for all SuperServer Bundles a long time ago, in the summer of 2016. I have no regrets in buying my 12 core system, I actively use mine hundreds of hours per month creating nearly all the content and videos here at TinkerTry.
      • It doesn't matter if you have 2 (included with Bundles) or 4 memory sticks installed, 2133MHz is the max you'll get if you have anything other than 8 cores in your Xeon D.
      • This Intel design restriction seems to be confirmed to be an industry-wide thing, seen on the various specs sheets here.

    I'm an optimist about this sort of thing, even issues that drag on as long as this one has. With so many companies making Intel Xeon D systems out there, and many designs enjoying at least 7 years of product life and support, Intel and Supermicro are highly motivated to resolve this issue. Intel has historically had very robust firmware, and their track record of many VMware/Intel X557 driver/VIB releases these past 2+ years demonstrates how active the Xeon D market continues to be.


    Jan 16 2018 Update

    TinkerTry is All-in!

    After weeks of failed X557 issue recreate at Supermicro HQ in San Jose CA, I'm shipping them my very own Xeon D-1567 SYS-5028D-TN4T to them. It just so happens to likely the very first such unit ever produced, but that's likely not relevant, as it appears 200 identical 12 core Xeon D-1567 motherboards were made in the same production run for Wiredzone. Since my system encounters the X557 issue within a day or two regardless of the workload, recreate at Supermicro shouldn't take long, and I'm even mailing them my Netgear XS708T switch too, just in case.

    How long it takes Supermicro and Intel to develop a proper fix is another matter, but I'm doing all I can to accelerate that process. This isn't a simple matter.

    Since my 12 core is my primary workstation & datacenter, to make this loan to Supermicro possible, TinkerTry.com, LLC has now invested in a 2nd identical
    Supermicro SuperServer Bundle 2 12 core. Once I have my drives and primary Windows 10 VM moved over, and a recreate accomplished on the exact configuration I intend to ship, I'll be able to send the affected system off to California. Special thanks to Wiredzone for helping with accelerated cross-country shipping, and to all my advertisers for making TinkerTry's 3rd Xeon D purchases possible. I stand behind anything I put my name, and reputation, behind.

    This new addition to my family will be a huge boon for my testing and staging and content creation, with one Xeon D-1541 and one Xeon D-1567 soon available to test and reboot, at will. Well, at least once my primary workstation returns, of course. Hopefully soon.

    I've also collected several Supermicro SM#s for them to investigate, from various TinkerTry readers who have been pitching in.

    Xeon D-1540 (8 core) and Xeon D-1587 (12 core) too?

    Note that the first-generation Xeon D was called the Xeon D-1540, and there is now one report of the same X557 network-down issue by verdragan in this article's comments. That is an interesting twist, perhaps this will give Supermicro and/or Intel some insight into root cause.

    You'll also noticed we now have a report from Xeon D-1587 owner takaze, with article above updated accordingly.

    Finally, there's a 6 SuperServer Bundle owner out there, 4 8 cores and 2 12 cores. He's only experienced these X557 issues on the 12 core systems. It breaks my heart to ask him to avoid 10G for now as a workaround, but I'm confident we'll get this resolved, without him having to sign an NDA or ship his system to Supermicro to update it for him.


    Jan 19 2018 Update

    Supermicro has been in communications with me about this, and I might not need to ship them my system after all, with many folks now involved. I don't have any significant new developments to share at this time, unfortunately.

    I've also reached out to an Intel spokesperson for assistance.


    Feb 02 2018 Update

    Preparing my system for shipment to Supermicro for recreate has proven to be much more challenging than anticipated. The problem is no longer happening on a daily basis, due to factors I haven't yet figured out. I will keep posting details on my tests right here.


    Mar 03 2018 Update

    During 3 weeks of heavy 10GbE testing on another Xeon D-1567 SuperServer (Bundle 2), the issue did not surface. Admittedly, I don't really know why, but using syslogging to my vRealize Log Insight, I'm able to confirm that I've had zero incidents of outages happened on that system, even though it was literally running the same OS, which is still ESXi 6.5U1 on USB, moved over to the temporary, new system.

    In way this is good, and it could explain why Wiredzone has had less than a handful of reports of this problem, with most of those coming from comments on this article. I still don't know why this issue happens on only some 12 core Xeon D systems, and not others.

    I managed to convince Supermicro to perform my X557 firmware flash remotely, as a pilot effort of sorts. This saved on shipping costs, and side-stepped the need to sign any NDAs.

    First, I carefully and temporarily expose my IPMI interface's IP address on my problematic Xeon D-1567 to Supermicro support on a new, public IP. They were then able to access my system's IPMI (only) over https, using a non-default very long password. This allowed their technician to flash my X557 to firmware level SDV23B. Gladly, for me, this immediately resolved the issue. Completely gone. Not a single incident of my 10GbE network ports going down again in the last 16 days of careful 24x7 monitoring, with that happy news shared back to Supermicro too, of course.

    The open question now is what to do about handling the other customers who are still on SDV23A, along with the one report of a customer on SDV23B but still having outages. For folks eager to avoid shipping charges and doing without their system for a while, who are also unwilling to sign the NDA, perhaps something like this procedure could be workable:

    1. contact Supermicro 24-Hour SuperServer Technical Support
    2. ask for the SDV23B fix for your Supermicro 12 or 16 core Xeon D system or motherboard, tell them your serial #
    3. come up with a mutually agreeable date and time for the upgrade
    4. prepare for the upgrade by temporarily removing/detaching all data drives
    5. put the IPMI IP address into the router's DMZ
    6. change the admin account's password to something long and complex
    7. inform the Supermicro technician that your system is ready for the SDV23B upgrade at https://*yourpublicipaddress (obtained from something like asking Google "what is my ip") and password longcomplexpassword*
    8. once informed the upgrade is complete, continue with the following clean-up steps
    9. take your IPMI interface out of the DMZ
    10. unplug power from the SuperServer for the changes to take effect
    11. insert/reattach the drives
    12. power up, watch the OS finish booting, see the green 10GbE LEDs illuminate

    I'm having little luck with my multiple attempts to reach out to Intel and Supermicro in the past 2 weeks, but I'm continuing to work closely with Wiredzone, who continue to be very helpful each and every step of the way. I will continue to inform my readers of progress right here, in this same article.


    Mar 16 2018 Update

    Outages are much more rare with SDV23B

    Unfortunately, after 31 days of no recurrence of this issue in my home lab, it happened again. I have reached out to Supermicro for next steps, but have not heard back from them yet. This is disappointing, and confirms what Devoid at TinkerTry here had previously reported.

    Xeon D-1540

    I now have another report of a user experiencing at least 1 10G network outage per day on the original Xeon D-1500 that existed at launch: the Xeon D-1540. It's an 8 core, and it too has the network outage issue.


    Mar 19 2018

    I'm looking into the firmware that's available straight from Intel here:

    The X552/X557 shares the same drivers as the popular Intel X540 PCIe NIC, which I've used in my home lab on a Sandy Bridge Core i7 system with no outages for many months.

    Here’s the relevant section of the output of

    esxcli network nic get -n vmnic3

    for each Xeon D system listed below, all running BIOS 1.2c.

    • Xeon D-1540 - had daily outages, now weekly
      Supermicro SuperServer SYS-5018-FN4T with NIC firmware 22.9 (2017-11-03) from Intel
      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800001cf, 255.65535.255
      Version: 4.5.3-iov

    • Xeon D-1541 - had zero outages, ever
      factory default, on my Bundle 2 Supermicro SuperServer SYS-5028D-TN4T

      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800003e7
      Version: 4.5.3-iov

    • Xeon D-1567 - had daily outages, now monthly
      my system updated to SDV23B, a Bundle 1 Supermicro SuperServer SYS-5028D-TN4T 12 core
      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800006b7
      Version: 4.5.3-iov

    May 06 2018 Update

    Supermicro-Xeon-D-1500-compared--TinkerTry

    I've been told today that all new Xeon D-1567 systems that Supermicro ships from San Jose CA to Wiredzone for resale already have the Intel X557 firmware fixes. I can also add that a newer Xeon D-1567 system that I had on loan back in January had zero incidences of 10GbE outages. The test period was 3 weeks of heavy testing as my temporary primary workstation, with careful syslog monitoring. That story should provide you with an additional level of reassurance that the issue is unlikely to be encountered by Wiredzone customers with Xeon D-1567 SuperServer Bundles delivered any time this year.

    Also worth noting that I still have no reports of X557 issues on Xeon D-1518 based SYS-E300-8D and Xeon D-1528 based SYS-E200-8D SuperServers of any vintage.

    Finally, I have two new 4 core Xeon D-1521 stories to share. These are great examples of folks helping each other out, and seem to help confirm that even with BIOS 1.3, folks are still having some X557 networking issues. It's apparently not just Xeon D owners with 12 or more cores either, and it's not just folks using 10GbE switches either. This is a 1GbE networking story, on a SuperServer that only has two X557 ports, and no 1GbE ports.

    Of course, I've shared these new stories with Supermicro as well, as it's a significant new spin on a 13 month old story, with a suggested work-around for VMware users using 1GbE switches, highlighted below.

    The first story comes from Bruno Zeidan, in his comment at TinkerTry here (excerpts):

    Nice post. Although, I have an issue with ESXi 6.7 on SuperMicro X10SDV-4C-TLN2F. NICs are recognized, but they are not getting link status updates or link is not going up. This is Xeon D-1521 which only has 2x 10GE network interfaces. Therefore, major issue as I don't have other means of network connectivity.
    I've been waiting for ESXi native support for this card, but in fact, after implementing 6.7, now the network link status are not detected. (keep saying Disconnected).
    I'm using Gigabit links although these are 10GE interfaces (it works and is supported).
    Did you test the 10GE interfaces? Do they work in your case?

    ...

    I was on BIOS 1.3 already. Actually, I managed to make it work, on ESX CLI, I did:
    esxcli network nic down -n vmnic0
    esxcli network nic up -n vmnic0

    But, it didn't survive to reboot. Every reboot, I had to do the same. Also installed latest driver ixgbe 4.5.3. But still not working.

    To fix the issue, solution was to set the speed manually for both NICs. Finally fixed.
    esxcli network nic set --speed 1000 --duplex full -n vmnic0
    esxcli network nic set --speed 1000 --duplex full -n vmnic1
    ...

    The second story arrives from Alessandro Segala, and his Xeon D-1521 based 4 core Supermicro X10SDV-4C-TLN2F motherboard, kicked off with this comment (excerpts):

    ...
    My motherboard only came with two 10gbe X557, and I don’t have any gigabit Ethernet. This has made things a bit challenging, but I’m able to run commands via the console using IPMI. I typed the model wrong, as it’s actually a X10SDV-4C-TLN2F (Xeon D-1521).

    I had tried reinstalling the VIB from Intel, but it still doesn’t work. ESXi recognizes the network adapters, but they’re reported as “down” all of the time. The cable is connected and the light is on.

    I have tried simply re-installing 6.7, following your guide for a fresh install, wondering if my specific installation was corrupted... that again did not fix the issue.

    I’m starting to wonder if there’s a bug preventing X557 network cards to work with 6.7 as management NICs? Although they are certified by VMWare to work, and the VIB lists 6.7 as supported.

    ...

    Paul, thanks for sharing the last link, it seems exactly the same issue I had. Sorry for not noticing it earlier.

    I have the same hardware Brian has, and, like him, I'm using gigabit network switches. Changing the configuration so speed is set to 1000 seems to have fixed the issue (and it survived a reboot). I had to do it using the ESXi Shell via IPMI (Alt + F1), which is quite awkward but worked. Sounds like it's a real bug.

    Now, I just need to re-configure ESXi based on my docs, since I had to do a fresh-install (ouch).

    PS: I did take the power off (and kept it off for a bit) twice, and it didn't work. I am already on the last BIOS (1.3) too.
    ...

    Based on the changed nature of this issue, and some discouraging news I've just received about self-service fix options, I've had to update the title from:

    • Temporary workaround to recover from intermittent Intel X552/X557 10GbE network outages on 12 and 16 core Xeon D, hoping for a public firmware update

    to:

    • Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue

    When I first wrote this article, I could have never know that would have been the right title at the time, the benefits of hindsight and valued TinkerTry feedback have been considerable. I'm very thankful!

    For folks not interested in signing an NDA to do the firmware update themselves, I'm still communicating with Wiredzone and Supermicro to get clarification on the potential for an RMA process for Wiredzone customers, where the firmware gets flashed for you. I'll append this article with further information, once it's received.

    As for me, with my October-2016-vintage Xeon D-1567 that's in my world's first TinkerTry'd Bundle 1 system, Supermicro performed the Intel X557 firmware patch a few months back. My recurrence of this issue (10GbE goes down and stays down) has been reduced to once every-other-month roughly. VMware's vRealize Log Insight has shown some brief (1 to 10 second) down/up event pairings, usually of both 10GbE ports. a few times a day. I've actually never noticed these events while using the system heavily as this is my primary workstation. I suspect they might actually be spurious false-positives, and if I aimed a camera at the network port LEDs 24x7, I suspect they actually never went off. But I don't really know for sure.


    May 29 2018 Update

    I received the following information about how folks affected by X557 10GbE network drops can get their system repaired via firmware flash done at Supermicro:

    ...the people who are responsible the barebone PM and motherboard PM and they won't provide any other way to apply the fix, unfortunately. This is because of their agreement with Intel, they are not willing to bend that. So any customer interested in having the fix for the network drop issue will have to ship the unit back to them, shipping both ways is still covered by Supermicro.

    This is a disappointing "fix." Given I don't know what they do, and don't have release notes to know if this is any different than what was done to my system, I likely won't be sending in my unit, since I no longer experience outages often enough (less than once a month) for it to be a significant problem for me.

    Please contact Supermicro Support directly to make arrangements. I don't have details about how these are handled outside of North America, but would love to hear how folks get treated who go through this return process via comments left below.


    Jun 12 2018 Update

    Unfortunately, the X557 link-loss problem has returned to my Xeon D-1567 system, happenning at a rate of roughly once per day now. I have no idea if applying IPMI 3.68 somehow made this happen a little more often. That's a very long shot. I will be shipping my system in to Supermicro when I have an opportunity to do so.

    One TinkerTry visitor has sent in a new report of his experience when getting the fix SDV23B, it worked for him, and that is good!

    Unfortunately, he also later received this email:

    I am sorry that our company doesn’t accept end user/personal NDA. You will have to bring the unit back for RMA. Please submit the ticket for RMA online.

    https://www.supermicro.com/support/rma/

    That would seem to indicate that individuals that don't list a company name on the NDA forms may not have a way to self-service fix this issue themselves, remotely.


    Jul 25 2018 Update

    I decided to take Wiredzone and Supermicro up on their offer to flash my X557 firmware at Supermicro. I was provided a UPS ground label, and was without my system from 6/15/2018 to 7/10/2018, that 25 days in all. Good thing I had a secondary 8 core SYS-5028D-TN4T that I could use in its place while it was away for so long. It's not yet clear what they did to my system, as far as the exact firmware, but I can say that it was still the original motherboard that was returned to me.

    I'm now also testing out newer ESXi drivers named ixben 1.7.1 for ESXi 6.7, and 1.6.5 for ESXi 6.5, more details here.


    See also at TinkerTry

    supermicro-superservers-vcg-updated-to-6-7

    how-to-install-esxi-on-xeon-d-1500-supermicro-superserver

    supermicro-superserver-bios-13-and-ipmi-358-released

    supermicro-superservers-vcg-updated-to-65u1

    vrealize-log-insight-install-configure-syslog-update

    promise-sanlink3-t1-adapter-gives-thunderbolt-3-usb-pc-10gbe

    xeon-d-landscape-2017

    ubiquiti-mpower-pro-8-port-outlet-measures-watts

    a-good-look-at-the-worlds-first-16-core-supermicro-superserver-xeon-d-1587-thanks-canada

    how-to-install-intel-x552-vib-on-esxi-6-on-superserver-5028d-tn4t

    See also

    The-Lone-Sysadmin