Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network link-down outages on Xeon D-1500 Series

Posted by Paul Braren on Dec 26 2017 (updated on Feb 19 2022) in
  • Network
  • HowTo
  • If you really want to skip the interesting backstory, jump right down to the symptom, the workaround, the proposed fix, and the actual fix done at Supermicro. This issue never affected the popular Xeon D-1541 systems with 8 cores.

    Backstory

    Yeah, it's a bit long, and somewhat complicated. At least I know it will feel good when this is all completely behind us!

    I'm glad that there are now hundreds of happy Supermicro SuperServer Bundle owners in the world who by-and-large have greatly enjoyed their product ownership experience, sharing a lot of positive experiences publicly, with under <0.5% of Bundle buyers returning their SuperServer to Wiredzone for any reason. I'm also glad to have had TinkerTry readers play an active role in bringing the even-more-capable 12 core Xeon D-1567 to market in the form of another Wiredzone Bundle. It ships already burn-in-tested and fully warranted, along with the latest tested BIOS and IPMI firmware already installed. This has made for a much-improved out-of-box experience for those eager to get right to work, and once the fix is available for this 10GbE issue, you can bet Wiredzone will quickly add that fix to their standard procedures.

    When it comes to 10GbE networking, it turns out the 12 core Xeon D's track record has been a little bumpy. It's quite possible that only a small proportion of those owners actually use their two 10GbE ports for 10G or 1G connectivity. It is with those folks in mind that I write this article, those most enthusiastic Xeon D fans that paid the premium for extra cores and a bigger heatsink/fan, expecting to enjoy nearly linear scaling.

    TinkerTry-labeled-10GbE-ports-on-back-of-Xeon-D-SuperServer-Bundle

    Some of those 12 core owners have unfortunately been experiencing an intermittent problem with network outages, where the physical link-layer LEDs go dark at some random time, for no apparent reason. This unfortunate link-down state currently has only one known recovery method. Shut down whatever OS you're running, then remove power from the system. That's an unacceptable "fix," more like a workaround really. This sort of power cycling can't be done remotely, at least if you don't happen to have a smart power strip already installed between your UPS and your SuperServer.

    First report - April 2017

    Back in April of 2017, a solitary report of some 10GbE strangeness arrived, documented so well by Devoid at TinkerTry here:

    Months back, I experienced the random disconnects of the 10G NICs. I did the BIOS and ICMP updates along with ESXi 6 U2 and the 4.4.1 x552 driver. Things had been running without issue for months, so I thought all was well. I recently decided to turn up some of my old VMs that I've had powered off. All was well for about 3 days. Now, it appears my 10G disconnecting NIC issue is back. I even just updated to 4.5.1 x552 driver...no luck. The kicker, no matter how many reboots, how many interface shutdowns (on both the switch and esxcli), and even pulling the network cable, nothing would bring up the links. I even tried hard setting speeds, nothing. The only solution, pull power to the X10SDV-12C-TLN4F. The huge problem this is causing me is that all my VM storage is on my synology NAS, via 10g (x540-T2 also hooked up to same Cisco switch, no issues). Once my supermicro 10g Links go down, all the VMs die.

    So. My questions: is anyone else experiencing this? Given that it ran fine for months with a few VMs, but came crashing down when I loaded it up, I'm wondering if it's linked to load?

    My setup:
    X10SDV-12C-TLN4F
    BIOS: 1.1c
    Firmware: 03.46
    ESXi 6.0.0, 4600944 (u2)
    Cisco 3750x, C3KX-NM-10GT
    Synology RS3413xs+
    HELP! I wouldn't even know who to talk to, Cisco, Supermicro, VMWare, Intel?

    I replied with good intentions, then David replied back again with good news:

    Update to the story. I haven't been able to open a VMWare ticket yet. Hopefully I will be able to through work, if needed.

    The good news, however, is that I did open a ticket with SuperMicro, and they responded pretty quickly, and suggested I update the firmware. They provided a specific firmware update for the 10G NICs. I'd pass is along, but it seems pretty specific. The firmware was labeled: SDV23A.

    So far, so good. 6+ days and counting. It has a decent load, so I plan to pile on a few more VMs, and keep monitoring.

    Second report - July 2017

    TinkerTry-Xeon-D-SuperServer-Cluster-featuring-Micron-NVMe-and-SSD-Supermicro-VMware-VSAN-demo-for-VMworld-2016.JPG
    My Netgear XS708T was in my basement, but my server was on my second floor. 100' of CAT7 and some attic adventures solved that problem.

    A few months later, another another report. I'll admit I didn't think too much of this second report, since an RMA swap resolved his issue. Without a 10GbE switch hooked up to my own Xeon D-1567 SuperServer Workstation Bundle 1, I didn't have a way to replicate the rarely encountered issue either. But I never forgot about it, and this incident motivated me to wire my home up for 10GbE. So I climbed into my sweltering hot attic in August to get myself some fresh 100' CAT7 cabling strung from my basement's Netgear XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch to my 2nd floor via the attic. Why? Well, I use my SuperServer Workstation near sleeping humans. While my Netgear XS708T was quieter than the Ubiquiti ES-16-XG switch I briefly tried (unboxing and testing), this switch was still far noisier than any of my Xeon D servers. That's why the switch stays in my basement, near my Xeon D-1541. At the time, I also hadn't heard about the Netgear XS708T's simple fan swap solution.

    The symptom

    Both of your 12 core Xeon D server's Intel X557 10GbE RJ45 port LEDs go dark at some seemingly random interval, a loss of link-layer. All LEDs go dark, the yellow link LED, and the green network speed LED. The frequency of these outages ranges from several times per day to once every few months. It can happen with whatever OS you're running, and seemingly random times, regardless of workload.

    Who might be affected

    Anybody with a 12 core Xeon D system using the X557 10GbE ports
    It's more complicated than that. To date, this issue seems to happen on:

    • Any OS
      I have reports of this problem occurring on:
      • VMware ESXi 6.0
      • VMware ESXi 6.5
      • XenServer 7.3
        That's not all OSs, but I don't have any reason to believe this doesn't happen on them, I likely just haven't received any reports of it happening on Windows yet.
    • Any 10G switch
      I have reports of this problem occurring on:
    • Any 12 core or 16 core Xeon D
      Presumably any brand of >12 core Xeon D system (there are many!), but I only first heard of this issue on Supermicro 12 core systems. This includes:
      • Xeon D-1557 featured on the X10SDV-12C-TLN4F motherboard as reported by Devoid here and discussed by phone recently
      • Xeon D-1567 featured on the X10SDV-12C+-WD002 motherboard (PIO-5028D-TN4T-01-WD002 in Windows Device Manager) that Wiredzone sells as part of the SYS-5028D-TN4T-12C SuperServer Bundle 1 and Bundle 2 system.
      • Xeon D-1577, Xeon D-1571, Xeon D-1559 likely affected too, see also the entire Intel Xeon Processor D Family (aka Broadwell DE) on Ark here.
    • Any network connection speed, 10GbE, and maybe 1GbE too
      • Presumably 1GbE links to the X557 network ports are also prone to this failure, but that's conjecture. Indications so far ar that this appears to be a firmware issue with the X557 itself.

    Failed workaround attempts

    I realize this is an odd section title, but when you read the bullet list, you'll start to gain a further understanding of why it has been challenging to get to the bottom of this issue.

    1. Power cycling the 10GbE network switch
    2. Upgrading firmware of the 10GbE switch
      I only tried this with my Netgear XS708T, it made no difference.
      I'm currently at 6.6.1.7, 1.0.0.8 level with 1.3.6.1.4.1.4526.100.4.39 System Object OID.
    3. Forcing different network negotiation methods in the device driver
      I only tried this these tweaks with the Intel driver VIBs would allow me to, under VMware ESXi 6.5U1
    4. Trying different CAT6a or CAT7 cables
    5. Trying different cable lengths
    6. Correlating OS events with network outage events, no obvious pattern after exploring syslog from Netgear and attached VMware ESXi 6.5U1 host, with the configuration of VMware vRealize Log Insight detailed at TinkerTry here.
    7. Applying the Intel X557 firmware SDV23A using Intel's SDVTLN4.BAT batch file on DOS bootable media fixed the issue for Devoid for a few months, but it didn't work for me. I encountering another outage in less than a day after the firmware upgrade, after a few weeks of uptime. I'm really not sure what this means, just not enough data yet. It's also possible my firmware update didn't complete successfully.

    Why am I unsure about the upgrade? It starts with my article:

    • How to check network driver and NIC firmware details in VMware ESXi
      used to find the following information for my Xeon D-1567, right after the SDV23A upgrade on my Xeon D-1567:

    • Xeon D-1567 (TinkerTry home lab 12/26/2017:)

       Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800005ad
           Version: 4.5.3-iov
    • Xeon D-1541 (TinkerTry home lab 12/26/2017):
      Driver Info:
           Bus Info: 0000:03:00.0
           Driver: ixgbe
           Firmware Version: 0x800003e7
           Version: 4.5.3-iov

    Yes, the firmware versions seem to differ. But do I know that SDV23A is supposed to give me 0x800005ad? Not entirely sure, and it doesn't show anywhere in my archive of all BIOS release notes.

    The workaround

    1. gracefully shut down your 12 core (or greater) system
    2. unplug the power cord for at least 15 second
    3. plug the power cord back in
    4. power up and boot your operating system up

    The workaround for VMware ESXi

    vSwitch1-Edit-Settings--TinkerTry

    This workaround won't prevent you from losing 10GbE connections, but it will allow an automatic fail-back to 1GbE for those occasions where powering down is very inconvenient.

    While you're likely using Intel I350 ETH0 for your service console, you can assign ETH1 to be your standby adapter.

    The fix

    1. Contact Supermicro's 24-Hour SuperServer Technical Support directly.
    2. Inform the technician that you're opening a service request for your 12 core Xeon D system because of Intel X557 10GbE networking issues, asking that they provide you with the firmware fix.
    3. Supermicro might insist you sign an NDA before they can share the fix with you, I've been told.
    4. If you get "push back," ask the technician to refer to Supermicro Service Record # SM1704244248 that was reported to TinkerTry readers here.
    5. Supermicro then sends you an Intel utility to flash the two X557 ports on the motherboard, I'm not sure what that level is.
    6. For customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    The better (future) fix

    A BIOS upgrade that also flashes both the Intel X557 (10G) and Intel I350 (1G) NICs, should it be confirmed that deploying those flash updates (without NDA) resolves these known issues.

    How failing fast might help everybody

    When troubleshooting such intermittent problems, it becomes important to find a way to make the problematic system fail fast. In other words, come up with an easy way to cause the problem without having to wait for it to happen naturally. Ideally, discovering a way for anybody to replicate both the entire system configuration, and the problem (network outage), on-demand, at-will. This would allow Supermicro a way to more easily recreate the issue themselves, the first step in getting a proper solution that anybody can apply to their own system. This solution would most likely be in the form of a new BIOS version, meanwhile, 1.2c is the latest BIOS currently available, see:

    As a blogger representing Supermicro owners like myself, I'm very reluctant to sign an NDA, much preferring to focus my energies on helping Supermicro find a solution that helps everybody anyway. For that to happen, these steps are likely needed first:

    1. A full recreate performed at Supermicro
    2. A fix is developed, presumably firmware
    3. Supermicro may need to coordinate with Intel for this
    4. QA testing of the fix done at Supermicro prior to GA release

    This all adds up to time. It will likely be weeks or even months before we have this fixed, and I'm very sorry about this temporary inconvenience. This article should help make that wait a little easier.

    Re-creation

    1. Install BIOS 1.2c and IPMI 3.58.
    2. Configure the BIOS exactly as shown here.
    3. Download free hypervisor ESXi 6.5.0a.
    4. Use iKVM to mount the ISO and install ESXi onto bootable USB media, such as on the readily available Sandisk.
    5. Once ESXi is configured, allow ssh (detailed in this article) then issue these two lines to download and install the latest Dec 04 2017 build 7388607 (helpful version history found here):
      esxcli software profile install -p ESXi-6.5.0-20171204001-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml
      reboot
    6. As usual with all 6.x builds of ESXi, you’ll notice no 10G drivers are working or even visible: the built-in VMware inbox drivers don’t work with X557, so you need to install the 4.5.3 VIB from here then
      reboot
    7. You may find that you now have no 10G connection at the physical level (no link LEDs on, on your 10G switch).
    8. This 10G network out problem can be resolved temporarily by shutting down and unplugging all system power for >15 seconds, then powering back up and booting ESXi back up, waiting for it to finish booting so the 10G driver loads and 10G speed indicator comes back on.

    Reference

    The network adapter part #s are seen on page 3 of Intel's document:

    You'll find the various products that share the same device driver:

    Product Codes: EZX557-AT, EZX557-AT2, and EZX557-AT4

    which are also listed by the Device ID that's found at various places in the vSphere GUIs:

    Table 1-2 Device ID
    X557 Device Vendor ID Device ID
    Intel® Ethernet Connection X557-AT (Single 19x19mm) 8086 0xB4A3
    Intel® Ethernet Connection X557-AT2 (Dual 19x19mm) 8086 0xB4C3
    Intel® Ethernet Connection X557-AT4 (Quad 25x25mm) 8086 0xB4B3

    Closing thoughts for tonight

    I'm working with Supermicro support directly to help get a fix to you, my valued TinkerTry reader who invested heavily in > 8 core Xeon D who demand stable 10GbE networking. It's now 10:22pm, and I we just finished a long phone call together working through all the details of this article.

    The 12 core Xeon D folks who are using the latest X557 firmware that Supermicro provides are reporting that their 10G network drop issues have gone away, and I've spoken to one such individual myself.


    Dec 27 2017 Update

    Typos and grammar cleaned up. More conversations with Supermicro planned soon, with future progress updates to be posted right here.

    FYI, there may also be a small issue for some 12 core Xeon D owners using their Intel I350 1GbE ports under ESXi, with ETH0 and ETH1 assignments occasionally swapped after ESXi 6.0 to 6.5 upgrades. This seems to be easily remedied by installing the supported driver for ESXi, as explained and shown here. It is odd how these 2 network issues only affect 12 core (and maybe 16 core) Xeon D owners. One really has to wonder how this could be, given how similar they are when compared with the 8 core Xeon D-1541.

    Why is an NDA for a firmware?

    I suspect Intel is requiring Supermicro to not widely distribute the firmware fix, and/or the DOS tool EEUPDATE that implements the fix. Maybe it's just beta, maybe there are legal restrictions, I don't know for sure, this is just conjecture. It's important to note that for customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.

    How does Xeon D-1541 differ from the Xeon D-1567?

    What I'm focused on here is things that differ that could affect the way the BIOS and IPMI are configured. The idea is to see if there's some good reason that the X557 would apparently begin to run into trouble only on systems with more cores.

    • Taller Heatsink - This assembly also includes a slightly bigger CPU fan.
      • This would seem to only cause Supermicro to slightly tweak the factory default fan speeds for the CPU fan header on the SoC/motherboard, which is very unlikely to change the temperatures of the physical X557 10G interfaces.
    • Slightly increased watt burn - Up to roughly 20% extra watts used versus 8 core models with the same clock speeds, and only when handling very heavy workloads.
      • Stress would seem to cause the system to come slightly closer to using about a third of the 250 watt power supply that the CSE-721TQ-250B comes with, the pieces that make up the SuperServer SYS-5028D-TN4T bare-bones systems.
      • It's hard to see how this would matter, especially since the X557 PHY should handle high ambient temps just fine, see the many ruggedized fanless designs. These network outages seem to happen during periods of inactivity/idle just as often as when the system is under load.
    • Slightly lower 2133MHz for up 128GB of ECC DDR4 - Intel's architectural restrictions mean that only the Xeon D 8 core design allows the system to negotiate 2400MHz DDR4 speeds at POST, as confirmed in the BIOS and explained here and here.
      • The 12 core, and all other Xeon D models (4, 6, and 16 core) negotiate 2133Mhz speeds, which is normal. The likelihood of noticing this speed difference during normal use is unlikely, perhaps up to a 5% difference that's likely only revealed with synthetic benchmarks.
        X10SDV-12C-TLN4F-cropped--TinkerTry
        Click to visit the Supermicro X10SDV-12C-TLN4F motherboard product page.
      • The advantages of an increased number of cores is a huge advantage for multi-threaded workloads. For pricing/marketplace reasons, Wiredzone cut over to 2400MHz for all SuperServer Bundles a long time ago, in the summer of 2016. I have no regrets in buying my 12 core system, I actively use mine hundreds of hours per month creating nearly all the content and videos here at TinkerTry.
      • It doesn't matter if you have 2 (included with Bundles) or 4 memory sticks installed, 2133MHz is the max you'll get if you have anything other than 8 cores in your Xeon D.
      • This Intel design restriction seems to be confirmed to be an industry-wide thing, seen on the various specs sheets here.

    I'm an optimist about this sort of thing, even issues that drag on as long as this one has. With so many companies making Intel Xeon D systems out there, and many designs enjoying at least 7 years of product life and support, Intel and Supermicro are highly motivated to resolve this issue. Intel has historically had very robust firmware, and their track record of many VMware/Intel X557 driver/VIB releases these past 2+ years demonstrates how active the Xeon D market continues to be.


    Jan 16 2018 Update

    TinkerTry is All-in!

    After weeks of failed X557 issue recreate at Supermicro HQ in San Jose CA, I'm shipping them my very own Xeon D-1567 SYS-5028D-TN4T to them. It just so happens to likely the very first such unit ever produced, but that's likely not relevant, as it appears 200 identical 12 core Xeon D-1567 motherboards were made in the same production run for Wiredzone. Since my system encounters the X557 issue within a day or two regardless of the workload, recreate at Supermicro shouldn't take long, and I'm even mailing them my Netgear XS708T switch too, just in case.

    How long it takes Supermicro and Intel to develop a proper fix is another matter, but I'm doing all I can to accelerate that process. This isn't a simple matter.

    Since my 12 core is my primary workstation & datacenter, to make this loan to Supermicro possible, TinkerTry.com, LLC has now invested in a 2nd identical
    Supermicro SuperServer Bundle 2 12 core. Once I have my drives and primary Windows 10 VM moved over, and a recreate accomplished on the exact configuration I intend to ship, I'll be able to send the affected system off to California. Special thanks to Wiredzone for helping with accelerated cross-country shipping, and to all my advertisers for making TinkerTry's 3rd Xeon D purchases possible. I stand behind anything I put my name, and reputation, behind.

    This new addition to my family will be a huge boon for my testing and staging and content creation, with one Xeon D-1541 and one Xeon D-1567 soon available to test and reboot, at will. Well, at least once my primary workstation returns, of course. Hopefully soon.

    I've also collected several Supermicro SM#s for them to investigate, from various TinkerTry readers who have been pitching in.

    Xeon D-1540 (8 core) and Xeon D-1587 (12 core) too?

    Note that the first-generation Xeon D was called the Xeon D-1540, and there is now one report of the same X557 network-down issue by verdragan in this article's comments. That is an interesting twist, perhaps this will give Supermicro and/or Intel some insight into root cause.

    You'll also noticed we now have a report from Xeon D-1587 owner takaze, with article above updated accordingly.

    Finally, there's a 6 SuperServer Bundle owner out there, 4 8 cores and 2 12 cores. He's only experienced these X557 issues on the 12 core systems. It breaks my heart to ask him to avoid 10G for now as a workaround, but I'm confident we'll get this resolved, without him having to sign an NDA or ship his system to Supermicro to update it for him.


    Jan 19 2018 Update

    Supermicro has been in communications with me about this, and I might not need to ship them my system after all, with many folks now involved. I don't have any significant new developments to share at this time, unfortunately.

    I've also reached out to an Intel spokesperson for assistance.


    Feb 02 2018 Update

    Preparing my system for shipment to Supermicro for recreate has proven to be much more challenging than anticipated. The problem is no longer happening on a daily basis, due to factors I haven't yet figured out. I will keep posting details on my tests right here.


    Mar 03 2018 Update

    During 3 weeks of heavy 10GbE testing on another Xeon D-1567 SuperServer (Bundle 2), the issue did not surface. Admittedly, I don't really know why, but using syslogging to my vRealize Log Insight, I'm able to confirm that I've had zero incidents of outages happened on that system, even though it was literally running the same OS, which is still ESXi 6.5U1 on USB, moved over to the temporary, new system.

    In way this is good, and it could explain why Wiredzone has had less than a handful of reports of this problem, with most of those coming from comments on this article. I still don't know why this issue happens on only some 12 core Xeon D systems, and not others.

    I managed to convince Supermicro to perform my X557 firmware flash remotely, as a pilot effort of sorts. This saved on shipping costs, and side-stepped the need to sign any NDAs.

    First, I carefully and temporarily expose my IPMI interface's IP address on my problematic Xeon D-1567 to Supermicro support on a new, public IP. They were then able to access my system's IPMI (only) over https, using a non-default very long password. This allowed their technician to flash my X557 to firmware level SDV23B. Gladly, for me, this immediately resolved the issue. Completely gone. Not a single incident of my 10GbE network ports going down again in the last 16 days of careful 24x7 monitoring, with that happy news shared back to Supermicro too, of course.

    The open question now is what to do about handling the other customers who are still on SDV23A, along with the one report of a customer on SDV23B but still having outages. For folks eager to avoid shipping charges and doing without their system for a while, who are also unwilling to sign the NDA, perhaps something like this procedure could be workable:

    1. contact Supermicro 24-Hour SuperServer Technical Support
    2. ask for the SDV23B fix for your Supermicro 12 or 16 core Xeon D system or motherboard, tell them your serial #
    3. come up with a mutually agreeable date and time for the upgrade
    4. prepare for the upgrade by temporarily removing/detaching all data drives
    5. put the IPMI IP address into the router's DMZ
    6. change the admin account's password to something long and complex
    7. inform the Supermicro technician that your system is ready for the SDV23B upgrade at https://*yourpublicipaddress (obtained from something like asking Google "what is my ip") and password longcomplexpassword*
    8. once informed the upgrade is complete, continue with the following clean-up steps
    9. take your IPMI interface out of the DMZ
    10. unplug power from the SuperServer for the changes to take effect
    11. insert/reattach the drives
    12. power up, watch the OS finish booting, see the green 10GbE LEDs illuminate

    I'm having little luck with my multiple attempts to reach out to Intel and Supermicro in the past 2 weeks, but I'm continuing to work closely with Wiredzone, who continue to be very helpful each and every step of the way. I will continue to inform my readers of progress right here, in this same article.


    Mar 16 2018 Update

    Outages are much more rare with SDV23B

    Unfortunately, after 31 days of no recurrence of this issue in my home lab, it happened again. I have reached out to Supermicro for next steps, but have not heard back from them yet. This is disappointing, and confirms what Devoid at TinkerTry here had previously reported.

    Xeon D-1540

    I now have another report of a user experiencing at least 1 10G network outage per day on the original Xeon D-1500 that existed at launch: the Xeon D-1540. It's an 8 core, and it too has the network outage issue.


    Mar 19 2018

    I'm looking into the firmware that's available straight from Intel here:

    The X552/X557 shares the same drivers as the popular Intel X540 PCIe NIC, which I've used in my home lab on a Sandy Bridge Core i7 system with no outages for many months.

    Here’s the relevant section of the output of

    esxcli network nic get -n vmnic3

    for each Xeon D system listed below, all running BIOS 1.2c.

    • Xeon D-1540 - had daily outages, now weekly
      Supermicro SuperServer SYS-5018-FN4T with NIC firmware 22.9 (2017-11-03) from Intel
      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800001cf, 255.65535.255
      Version: 4.5.3-iov

    • Xeon D-1541 - had zero outages, ever
      factory default, on my Bundle 2 Supermicro SuperServer SYS-5028D-TN4T

      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800003e7
      Version: 4.5.3-iov

    • Xeon D-1567 - had daily outages, now monthly
      my system updated to SDV23B, a Bundle 1 Supermicro SuperServer SYS-5028D-TN4T 12 core
      Driver Info:
      Bus Info: 0000:03:00.1
      Driver: ixgbe
      Firmware Version: 0x800006b7
      Version: 4.5.3-iov

    May 06 2018 Update

    Supermicro-Xeon-D-1500-compared--TinkerTry

    I've been told today that all new Xeon D-1567 systems that Supermicro ships from San Jose CA to Wiredzone for resale already have the Intel X557 firmware fixes. I can also add that a newer Xeon D-1567 system that I had on loan back in January had zero incidences of 10GbE outages. The test period was 3 weeks of heavy testing as my temporary primary workstation, with careful syslog monitoring. That story should provide you with an additional level of reassurance that the issue is unlikely to be encountered by Wiredzone customers with Xeon D-1567 SuperServer Bundles delivered any time this year.

    Also worth noting that I still have no reports of X557 issues on Xeon D-1518 based SYS-E300-8D and Xeon D-1528 based SYS-E200-8D SuperServers of any vintage.

    Finally, I have two new 4 core Xeon D-1521 stories to share. These are great examples of folks helping each other out, and seem to help confirm that even with BIOS 1.3, folks are still having some X557 networking issues. It's apparently not just Xeon D owners with 12 or more cores either, and it's not just folks using 10GbE switches either. This is a 1GbE networking story, on a SuperServer that only has two X557 ports, and no 1GbE ports.

    Of course, I've shared these new stories with Supermicro as well, as it's a significant new spin on a 13 month old story, with a suggested work-around for VMware users using 1GbE switches, highlighted below.

    The first story comes from Bruno Zeidan, in his comment at TinkerTry here (excerpts):

    Nice post. Although, I have an issue with ESXi 6.7 on SuperMicro X10SDV-4C-TLN2F. NICs are recognized, but they are not getting link status updates or link is not going up. This is Xeon D-1521 which only has 2x 10GE network interfaces. Therefore, major issue as I don't have other means of network connectivity.
    I've been waiting for ESXi native support for this card, but in fact, after implementing 6.7, now the network link status are not detected. (keep saying Disconnected).
    I'm using Gigabit links although these are 10GE interfaces (it works and is supported).
    Did you test the 10GE interfaces? Do they work in your case?

    ...

    I was on BIOS 1.3 already. Actually, I managed to make it work, on ESX CLI, I did:
    esxcli network nic down -n vmnic0
    esxcli network nic up -n vmnic0

    But, it didn't survive to reboot. Every reboot, I had to do the same. Also installed latest driver ixgbe 4.5.3. But still not working.

    To fix the issue, solution was to set the speed manually for both NICs. Finally fixed.
    esxcli network nic set --speed 1000 --duplex full -n vmnic0
    esxcli network nic set --speed 1000 --duplex full -n vmnic1
    ...

    The second story arrives from Alessandro Segala, and his Xeon D-1521 based 4 core Supermicro X10SDV-4C-TLN2F motherboard, kicked off with this comment (excerpts):

    ...
    My motherboard only came with two 10gbe X557, and I don’t have any gigabit Ethernet. This has made things a bit challenging, but I’m able to run commands via the console using IPMI. I typed the model wrong, as it’s actually a X10SDV-4C-TLN2F (Xeon D-1521).

    I had tried reinstalling the VIB from Intel, but it still doesn’t work. ESXi recognizes the network adapters, but they’re reported as “down” all of the time. The cable is connected and the light is on.

    I have tried simply re-installing 6.7, following your guide for a fresh install, wondering if my specific installation was corrupted... that again did not fix the issue.

    I’m starting to wonder if there’s a bug preventing X557 network cards to work with 6.7 as management NICs? Although they are certified by VMWare to work, and the VIB lists 6.7 as supported.

    ...

    Paul, thanks for sharing the last link, it seems exactly the same issue I had. Sorry for not noticing it earlier.

    I have the same hardware Brian has, and, like him, I'm using gigabit network switches. Changing the configuration so speed is set to 1000 seems to have fixed the issue (and it survived a reboot). I had to do it using the ESXi Shell via IPMI (Alt + F1), which is quite awkward but worked. Sounds like it's a real bug.

    Now, I just need to re-configure ESXi based on my docs, since I had to do a fresh-install (ouch).

    PS: I did take the power off (and kept it off for a bit) twice, and it didn't work. I am already on the last BIOS (1.3) too.
    ...

    Based on the changed nature of this issue, and some discouraging news I've just received about self-service fix options, I've had to update the title from:

    • Temporary workaround to recover from intermittent Intel X552/X557 10GbE network outages on 12 and 16 core Xeon D, hoping for a public firmware update

    to:

    • Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue

    When I first wrote this article, I could have never know that would have been the right title at the time, the benefits of hindsight and valued TinkerTry feedback have been considerable. I'm very thankful!

    For folks not interested in signing an NDA to do the firmware update themselves, I'm still communicating with Wiredzone and Supermicro to get clarification on the potential for an RMA process for Wiredzone customers, where the firmware gets flashed for you. I'll append this article with further information, once it's received.

    As for me, with my October-2016-vintage Xeon D-1567 that's in my world's first TinkerTry'd Bundle 1 system, Supermicro performed the Intel X557 firmware patch a few months back. My recurrence of this issue (10GbE goes down and stays down) has been reduced to once every-other-month roughly. VMware's vRealize Log Insight has shown some brief (1 to 10 second) down/up event pairings, usually of both 10GbE ports. a few times a day. I've actually never noticed these events while using the system heavily as this is my primary workstation. I suspect they might actually be spurious false-positives, and if I aimed a camera at the network port LEDs 24x7, I suspect they actually never went off. But I don't really know for sure.


    May 29 2018 Update

    I received the following information about how folks affected by X557 10GbE network drops can get their system repaired via firmware flash done at Supermicro:

    ...the people who are responsible the barebone PM and motherboard PM and they won't provide any other way to apply the fix, unfortunately. This is because of their agreement with Intel, they are not willing to bend that. So any customer interested in having the fix for the network drop issue will have to ship the unit back to them, shipping both ways is still covered by Supermicro.

    This is a disappointing "fix." Given I don't know what they do, and don't have release notes to know if this is any different than what was done to my system, I likely won't be sending in my unit, since I no longer experience outages often enough (less than once a month) for it to be a significant problem for me.

    Please contact Supermicro Support directly to make arrangements. I don't have details about how these are handled outside of North America, but would love to hear how folks get treated who go through this return process via comments left below.


    Jun 12 2018 Update

    Unfortunately, the X557 link-loss problem has returned to my Xeon D-1567 system, happening at a rate of roughly once per day now. I have no idea if applying IPMI 3.68 somehow made this happen a little more often. That's a very long shot. I will be shipping my system in to Supermicro when I have an opportunity to do so.

    One TinkerTry visitor has sent in a new report of his experience when getting the fix SDV23B, it worked for him, and that is good!

    Unfortunately, he also later received this email:

    I am sorry that our company doesn’t accept end user/personal NDA. You will have to bring the unit back for RMA. Please submit the ticket for RMA online.

    https://www.supermicro.com/support/rma/

    That would seem to indicate that individuals that don't list a company name on the NDA forms may not have a way to self-service fix this issue themselves, remotely.


    Jul 25 2018 Update

    I decided to take Wiredzone and Supermicro up on their offer to flash my X557 firmware at Supermicro. I was provided a UPS ground label, and was without my system from 6/15/2018 to 7/10/2018, that 25 days in all. Good thing I had a secondary 8 core SYS-5028D-TN4T that I could use in its place while it was away for so long. It's not yet clear what they did to my system, as far as the exact firmware, but I can say that it was still the original motherboard that was returned to me.

    I'm now also testing out newer ESXi drivers named ixben 1.7.1 for ESXi 6.7, and 1.6.5 for ESXi 6.5, more details here.


    Sep 30 2018 Update

    Unfortunately, today, I received the first report of an 8 core SYS-5028D-TN4T / Xeon D-1541 that experienced a 10GbE port going down. It was on BIOS 2.0 / IPMI 3.68, more details to follow.


    May 23 2020 Update

    Since my last update on Sep 30 2018, I'm not aware of further reports here at TinkerTry about outages on Xeon D-1541.

    Unfortunately, based on numerous reports of E200-8D owners running into similar difficulties, and the never-ending saga, I've felt it time to update the title of this article.

    From:
    Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue

    To:
    Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network link-down outages on Xeon D-1500 Series

    I have also received an incredibly detailed set of instructions in a comment below by blogthis, his entire post captured here below, verbatim:


    blogthis • 2 hours ago • edited
    After literally years of my tinkering with my system, and my scattered posts and replies to now somewhat-buried posts throughout the original articles comments.. it has become a bit unclear where to dig to find a concise "Latest Info" of consolidated findings.

    So.. SUMMARY of my Stable setup this past year, an update all in one place, what I keep getting asked in reply post chains - I hope this helps clarify my fragmented findings as of today, my currently working config:

    Mini-tower SYS-5028 Series:
    SYS-5028D-TN4T-12C (MBD-X10SDV-12C+-WD002 in Chassis SC721TQ-250B)
    Deprecated Prior 10G Firmware (Intel SDV-23A); ESXCLI = 0x800005ad
    Upgraded New 10G Firmware (Intel SDV-23B); ESXCLI = 0x800006b7
    ((update from SuperMicro TW using "SDV23B_UEFI.zip"))
    TPM 2.0 module (AOM-TPM-9665V-S)
    Datacenter Management Package (SFT-DCMS-SINGLE)
    IPMI = v3.86
    BIOS = v2.1
    ESXi = v6.7-Update3b

    VIB's:
    Disabled Intel "legacy" IGB-1G and IXGB-10G
    Install-Enabled: Intel next-gen IGBN-1G (e.g. for I350 chips):
    ((vmw-esx-6.7.0-igbn-1.4.10-offline_bundle-14160242.zip))
    Install-Enabled: Intel next-gen IXGBEN-10G (e.g. for X557 chips):
    ((vmw-esx-6.7.0-ixgben-1.7.20-offline_bundle-14162871.zip))

    • There are newer drivers available, but these are working for me.

    System:
    HEAT from Qty-2 Enterprise SATA Intel DC S3520 1.6TB SSD (for supercaps).
    HEAT from Enterprise PCIe Intel DC S3520 2TB (AIC for speed/supercaps).
    HEAT from SuperMicro 128G SATA-DOM; FAST BOOT vs USB (SSD-DM128-SMCMVN1).
    HEAT from Qty-2 of 10GBaseT LACP to NetGear XS724EM (firmware 1.0.1.1).

    • This is for Synology DS3617xs SAN/NAS, latency-tolerant big files.
      HEAT from Qty-2 of 1G LACP to Cisco SG-C300-20 (firmware 1.4.11.4).
    • This is for OOB management, and internet needing minimum latency.
      COOLING from high-quality Cat7 cables (avoids higher watt-heat output).
      COOLING from mod adding high-CFM Notura front-inside intake fan.
      COOLING from mod adding SuperMicro air-shroud (MCP-310-00076-0B).
      COOLING from pre-installed CPU fan (only available in Mini-Tower config).

    Important - Intel Spec Sheets vs SuperMicro IMPI (inconsistent!):

    • The SuperMicro MB_10G temp IPMI before alarm is 100C, but seems wrong?
    • Intel Ark spec sheets for 10G X520/X550/X552/X557, all say max 55C..!!

    IMHO..
    Quiet fan settings are only viable if low-load and NOT using 10G:

    Standard fan speed, IPMI MB_10G Temp: 53C
    Hot at ONLY 2C under thermal limit
    Fan1: 900rpm
    Fan2: 3100rpm
    Fan3: 900rpm

    Optimal fan speed, IPMI MB_10G Temp: 52C
    Hot at ONLY 3C under thermal limit
    Fan1: 1100rpm
    Fan2: 3300rpm
    Fan3: 900rpm

    IMHO..
    Reliable fan settings if higher-load OR using 10G:
    ** For clarity, I suggest ONLY use these fan speeds ***

    HeavyIO fan speed:
    MB_10G Temp: 45C-46C (of 55C max)

    Fan1: 2600-2700rpm
    Fan2: 4700-4800rpm
    Fan3: 1400rpm

    Full-Max fan speed:
    MB_10G Temp: 44C (of 55C max)

    Fan1: 3000rpm
    Fan2: 5000rpm
    Fan3: 1700rpm
    (It seems this is best my fans can do, but faster/louder fans are available)

    Summary:

    • To achieve Intel specs under load, for me it needed fw, mods, extra fans, at high-rpm.
    • All this emphasizes my opinion of an overheat root cause to this topic.. not just firmware.

    I hope this helps someone, and I look forward to reading future thoughts.


    Feb 16 2022 Update

    More stories related to overheating, this is very interesting, added these comments from Disqus below right into the article, to make it much easier to find.

    Matt G • a month ago • edited
    I spent the last few days pulling my hair out after trying to change over which NIC I was using on my Supermicro Xeon D-1518 SYS-E300-8D X10SDV-TP8F based system. Namely it was originally using the gigabit NIC, and since Comcast has upgraded to 1.4gbps on their modems, I wanted to move over to the 10G SFP+ X552 port. However despite rebooting, powering up/down, trying different ixgbe drivers, trying unsupported SFP options, trying 10gb LR, SR, 10BASE-T, NBASE-T, DAC cables, etc. NOTHING worked to achieve link light even though the port itself was responsive to SFP metadata inquiries, detection of SFP insertion, etc. Eventually I just cleaned the dust off inside the case and outside and it worked. Geez....lends itself nicely to the overheating and/or fan speed theories. (I was already on the latest Supermicro BIOS for a long time so it wasn't that nor did I try to manually change any X552 firmware/etc.) Now I can get 1.4G on my Comcast downloads! Yip! Thanks for the page. If my cleanup job doesn't last I may try the "/utils/bin/ipmicfg-linux.x86_64 -fan 1" command to always max out the fan speeds.

    blogthis RyanCCC • 3 days ago • edited
    Suggest reading my in-depth article about overheating of the 10G onboard chipset. It has been stable ever since I did the two big highlights in my post: (1) add any rear optional fan you can mount into the rear grill to blow air directly onto the 10G chipset to cool it more and YES there is a plug on the motherboard for this optional rear-mounted fan (that fan SuperMicro discontinued but the model of fan I posted fits at a lightly off-angle into the rear-intake grill but works fine).... AND (2) add the SuperMicro internal case "shroud" that is essentially a piece of teflon/plastic with very special cut groove to clip into the rear of the case, and redirects incoming air also into the motherboard and chipset (so obviously SuperMicro from Day-1 has been well aware of its over heat problems on this motherboard model)... NOTE you'll have to trim it with scissors if your using motherboard model with the larger heat-sync-fan on top (such as the 12c model)... I also just posted about the firmware NOT being under NDA anymore with SuperMicro FAQ now pointing to official Intel download of it for this infamous 10G issue. HOWEVER, it's NOT exactly the same FW that you get from SuperMicro (their method was extremely cryptic for EFI update).


    Feb 19 2022 Update

    blogthis RyanCCC • 5 hours ago
    Paul Braren | TinkerTry.com - Wanted to clarify something. The 10G Chipset FW is updated by SuperMicro provided (Intel OEM) file SDV23B_UEFI.zip to ESXCLI = 0x800006b7, and while you can still ask for it from SuperMicro (the OEM), this is seemingly being phased as a standalone solution I'm guessing from the high risk of incorrect install I mentioned. The UEFI Extension-driver (in BIOS) is updated by Intel provided (generic) intel direct-download, which seems to help work-around this issue. And finally, the ixgbe drivers have been deprecated for so long I doubt the latest UEFI extension-drivers are tested with it so I suggest the native (newer) version of ixgben as the replacement. You will need to obtain the ixgben intel 10G VIB (driver) from vmware.com. If unfamiliar with how to disable the older ixgbe if still lingering in your system and enabling the newer-native ixgben successor, this may help: https://kb.vmware.com/s/article/2147565


    See also at TinkerTry

    supermicro-superservers-vcg-updated-to-6-7

    how-to-install-esxi-on-xeon-d-1500-supermicro-superserver

    supermicro-superserver-bios-13-and-ipmi-358-released

    supermicro-superservers-vcg-updated-to-65u1

    vrealize-log-insight-install-configure-syslog-update

    promise-sanlink3-t1-adapter-gives-thunderbolt-3-usb-pc-10gbe

    xeon-d-landscape-2017

    ubiquiti-mpower-pro-8-port-outlet-measures-watts

    a-good-look-at-the-worlds-first-16-core-supermicro-superserver-xeon-d-1587-thanks-canada

    how-to-install-intel-x552-vib-on-esxi-6-on-superserver-5028d-tn4t

    See also

    The-Lone-Sysadmin