Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network link-down outages on Xeon D-1500 Series
If you really want to skip the interesting backstory, jump right down to the symptom, the workaround, the proposed fix, and the actual fix done at Supermicro. This issue never affected the popular Xeon D-1541 systems with 8 cores.
Yeah, it's a bit long, and somewhat complicated. At least I know it will feel good when this is all completely behind us!
I'm glad that there are now hundreds of happy Supermicro SuperServer Bundle owners in the world who by-and-large have greatly enjoyed their product ownership experience, sharing a lot of positive experiences publicly, with under <0.5% of Bundle buyers returning their SuperServer to Wiredzone for any reason. I'm also glad to have had TinkerTry readers play an active role in bringing the even-more-capable 12 core Xeon D-1567 to market in the form of another Wiredzone Bundle. It ships already burn-in-tested and fully warranted, along with the latest tested BIOS and IPMI firmware already installed. This has made for a much-improved out-of-box experience for those eager to get right to work, and once the fix is available for this 10GbE issue, you can bet Wiredzone will quickly add that fix to their standard procedures.
When it comes to 10GbE networking, it turns out the 12 core Xeon D's track record has been a little bumpy. It's quite possible that only a small proportion of those owners actually use their two 10GbE ports for 10G or 1G connectivity. It is with those folks in mind that I write this article, those most enthusiastic Xeon D fans that paid the premium for extra cores and a bigger heatsink/fan, expecting to enjoy nearly linear scaling.
Some of those 12 core owners have unfortunately been experiencing an intermittent problem with network outages, where the physical link-layer LEDs go dark at some random time, for no apparent reason. This unfortunate link-down state currently has only one known recovery method. Shut down whatever OS you're running, then remove power from the system. That's an unacceptable "fix," more like a workaround really. This sort of power cycling can't be done remotely, at least if you don't happen to have a smart power strip already installed between your UPS and your SuperServer.
Months back, I experienced the random disconnects of the 10G NICs. I did the BIOS and ICMP updates along with ESXi 6 U2 and the 4.4.1 x552 driver. Things had been running without issue for months, so I thought all was well. I recently decided to turn up some of my old VMs that I've had powered off. All was well for about 3 days. Now, it appears my 10G disconnecting NIC issue is back. I even just updated to 4.5.1 x552 driver...no luck. The kicker, no matter how many reboots, how many interface shutdowns (on both the switch and esxcli), and even pulling the network cable, nothing would bring up the links. I even tried hard setting speeds, nothing. The only solution, pull power to the X10SDV-12C-TLN4F. The huge problem this is causing me is that all my VM storage is on my synology NAS, via 10g (x540-T2 also hooked up to same Cisco switch, no issues). Once my supermicro 10g Links go down, all the VMs die.
So. My questions: is anyone else experiencing this? Given that it ran fine for months with a few VMs, but came crashing down when I loaded it up, I'm wondering if it's linked to load?
ESXi 6.0.0, 4600944 (u2)
Cisco 3750x, C3KX-NM-10GT
HELP! I wouldn't even know who to talk to, Cisco, Supermicro, VMWare, Intel?
Update to the story. I haven't been able to open a VMWare ticket yet. Hopefully I will be able to through work, if needed.
The good news, however, is that I did open a ticket with SuperMicro, and they responded pretty quickly, and suggested I update the firmware. They provided a specific firmware update for the 10G NICs. I'd pass is along, but it seems pretty specific. The firmware was labeled: SDV23A.
So far, so good. 6+ days and counting. It has a decent load, so I plan to pile on a few more VMs, and keep monitoring.
A few months later, another another report. I'll admit I didn't think too much of this second report, since an RMA swap resolved his issue. Without a 10GbE switch hooked up to my own Xeon D-1567 SuperServer Workstation Bundle 1, I didn't have a way to replicate the rarely encountered issue either. But I never forgot about it, and this incident motivated me to wire my home up for 10GbE. So I climbed into my sweltering hot attic in August to get myself some fresh 100' CAT7 cabling strung from my basement's Netgear XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch to my 2nd floor via the attic. Why? Well, I use my SuperServer Workstation near sleeping humans. While my Netgear XS708T was quieter than the Ubiquiti ES-16-XG switch I briefly tried (unboxing and testing), this switch was still far noisier than any of my Xeon D servers. That's why the switch stays in my basement, near my Xeon D-1541. At the time, I also hadn't heard about the Netgear XS708T's simple fan swap solution.
Both of your 12 core Xeon D server's Intel X557 10GbE RJ45 port LEDs go dark at some seemingly random interval, a loss of link-layer. All LEDs go dark, the yellow link LED, and the green network speed LED. The frequency of these outages ranges from several times per day to once every few months. It can happen with whatever OS you're running, and seemingly random times, regardless of workload.
Anybody with a 12 core Xeon D system using the X557 10GbE ports
It's more complicated than that. To date, this issue seems to happen on:
- Any OS
I have reports of this problem occurring on:
- VMware ESXi 6.0
- VMware ESXi 6.5
- XenServer 7.3
That's not all OSs, but I don't have any reason to believe this doesn't happen on them, I likely just haven't received any reports of it happening on Windows yet.
- Any 10G switch
I have reports of this problem occurring on:
- XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch
I reported this issue to Netgear as Case# 28837686. Initially, it was difficult to determine which network element was the root cause of the problem, so I was wondering if Netgear could see anything useful in the logs. Ultimately, there was nothing in the switch that described why the ports went down. See also many 10GbE network cabling problem reports at ServeTheHome here.
- XS716T ProSAFE 16-port 10-Gigabit Smart Managed Switch reported by a Bundle 1 owner via email to me.
- Cisco Catalyst C3KX-NM-10G Network Module reported by Devoid here.
- XS708T ProSAFE 8-Port 10-Gigabit Smart Managed Switch
- Any 12 core or 16 core Xeon D
Presumably any brand of >12 core Xeon D system (there are many!), but I only first heard of this issue on Supermicro 12 core systems. This includes:
- Xeon D-1557 featured on the X10SDV-12C-TLN4F motherboard as reported by Devoid here and discussed by phone recently
- Xeon D-1567 featured on the
PIO-5028D-TN4T-01-WD002in Windows Device Manager) that Wiredzone sells as part of the
SYS-5028D-TN4T-12CSuperServer Bundle 1 and Bundle 2 system.
- Xeon D-1577, Xeon D-1571, Xeon D-1559 likely affected too, see also the entire Intel Xeon Processor D Family (aka Broadwell DE) on Ark here.
- Any network connection speed, 10GbE, and maybe 1GbE too
- Presumably 1GbE links to the X557 network ports are also prone to this failure, but that's conjecture. Indications so far ar that this appears to be a firmware issue with the X557 itself.
I realize this is an odd section title, but when you read the bullet list, you'll start to gain a further understanding of why it has been challenging to get to the bottom of this issue.
- Power cycling the 10GbE network switch
- Upgrading firmware of the 10GbE switch
I only tried this with my Netgear XS708T, it made no difference.
I'm currently at 184.108.40.206, 220.127.116.11 level with 18.104.22.168.4.1.4522.214.171.124 System Object OID.
- Forcing different network negotiation methods in the device driver
I only tried this these tweaks with the Intel driver VIBs would allow me to, under VMware ESXi 6.5U1
- Trying different CAT6a or CAT7 cables
- Trying different cable lengths
- Correlating OS events with network outage events, no obvious pattern after exploring syslog from Netgear and attached VMware ESXi 6.5U1 host, with the configuration of VMware vRealize Log Insight detailed at TinkerTry here.
- Applying the Intel X557 firmware SDV23A using Intel's SDVTLN4.BAT batch file on DOS bootable media fixed the issue for Devoid for a few months, but it didn't work for me. I encountering another outage in less than a day after the firmware upgrade, after a few weeks of uptime. I'm really not sure what this means, just not enough data yet. It's also possible my firmware update didn't complete successfully.
Why am I unsure about the upgrade? It starts with my article:
How to check network driver and NIC firmware details in VMware ESXi
used to find the following information for my Xeon D-1567, right after the SDV23A upgrade on my Xeon D-1567:
Xeon D-1567 (TinkerTry home lab 12/26/2017:)
Driver Info: Bus Info: 0000:03:00.0 Driver: ixgbe Firmware Version: 0x800005ad Version: 4.5.3-iov
- Xeon D-1541 (TinkerTry home lab 12/26/2017):
Driver Info: Bus Info: 0000:03:00.0 Driver: ixgbe Firmware Version: 0x800003e7 Version: 4.5.3-iov
Yes, the firmware versions seem to differ. But do I know that SDV23A is supposed to give me 0x800005ad? Not entirely sure, and it doesn't show anywhere in my archive of all BIOS release notes.
- gracefully shut down your 12 core (or greater) system
- unplug the power cord for at least 15 second
- plug the power cord back in
- power up and boot your operating system up
This workaround won't prevent you from losing 10GbE connections, but it will allow an automatic fail-back to 1GbE for those occasions where powering down is very inconvenient.
While you're likely using Intel I350 ETH0 for your service console, you can assign ETH1 to be your standby adapter.
- Contact Supermicro's 24-Hour SuperServer Technical Support directly.
- Inform the technician that you're opening a service request for your 12 core Xeon D system because of Intel X557 10GbE networking issues, asking that they provide you with the firmware fix.
- Supermicro might insist you sign an NDA before they can share the fix with you, I've been told.
- If you get "push back," ask the technician to refer to Supermicro Service Record # SM1704244248 that was reported to TinkerTry readers here.
- Supermicro then sends you an Intel utility to flash the two X557 ports on the motherboard, I'm not sure what that level is.
- For customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.
A BIOS upgrade that also flashes both the Intel X557 (10G) and Intel I350 (1G) NICs, should it be confirmed that deploying those flash updates (without NDA) resolves these known issues.
When troubleshooting such intermittent problems, it becomes important to find a way to make the problematic system fail fast. In other words, come up with an easy way to cause the problem without having to wait for it to happen naturally. Ideally, discovering a way for anybody to replicate both the entire system configuration, and the problem (network outage), on-demand, at-will. This would allow Supermicro a way to more easily recreate the issue themselves, the first step in getting a proper solution that anybody can apply to their own system. This solution would most likely be in the form of a new BIOS version, meanwhile, 1.2c is the latest BIOS currently available, see:
As a blogger representing Supermicro owners like myself, I'm very reluctant to sign an NDA, much preferring to focus my energies on helping Supermicro find a solution that helps everybody anyway. For that to happen, these steps are likely needed first:
- A full recreate performed at Supermicro
- A fix is developed, presumably firmware
- Supermicro may need to coordinate with Intel for this
- QA testing of the fix done at Supermicro prior to GA release
This all adds up to time. It will likely be weeks or even months before we have this fixed, and I'm very sorry about this temporary inconvenience. This article should help make that wait a little easier.
- Install BIOS 1.2c and IPMI 3.58.
- Configure the BIOS exactly as shown here.
- Download free hypervisor ESXi 6.5.0a.
- Use iKVM to mount the ISO and install ESXi onto bootable USB media, such as on the readily available Sandisk.
- Once ESXi is configured, allow ssh (detailed in this article) then issue these two lines to download and install the latest Dec 04 2017 build 7388607 (helpful version history found here):
esxcli software profile install -p ESXi-6.5.0-20171204001-standard -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml reboot
- As usual with all 6.x builds of ESXi, you’ll notice no 10G drivers are working or even visible: the built-in VMware inbox drivers don’t work with X557, so you need to install the 4.5.3 VIB from here then
- You may find that you now have no 10G connection at the physical level (no link LEDs on, on your 10G switch).
- This 10G network out problem can be resolved temporarily by shutting down and unplugging all system power for >15 seconds, then powering back up and booting ESXi back up, waiting for it to finish booting so the 10G driver loads and 10G speed indicator comes back on.
The network adapter part #s are seen on page 3 of Intel's document:
You'll find the various products that share the same device driver:
Product Codes: EZX557-AT, EZX557-AT2, and EZX557-AT4
which are also listed by the Device ID that's found at various places in the vSphere GUIs:
Table 1-2 Device ID
X557 Device Vendor ID Device ID
Intel® Ethernet Connection X557-AT (Single 19x19mm) 8086 0xB4A3
Intel® Ethernet Connection X557-AT2 (Dual 19x19mm) 8086 0xB4C3
Intel® Ethernet Connection X557-AT4 (Quad 25x25mm) 8086 0xB4B3
I'm working with Supermicro support directly to help get a fix to you, my valued TinkerTry reader who invested heavily in > 8 core Xeon D who demand stable 10GbE networking. It's now 10:22pm, and I we just finished a long phone call together working through all the details of this article.
The 12 core Xeon D folks who are using the latest X557 firmware that Supermicro provides are reporting that their 10G network drop issues have gone away, and I've spoken to one such individual myself.
Typos and grammar cleaned up. More conversations with Supermicro planned soon, with future progress updates to be posted right here.
FYI, there may also be a small issue for some 12 core Xeon D owners using their Intel I350 1GbE ports under ESXi, with ETH0 and ETH1 assignments occasionally swapped after ESXi 6.0 to 6.5 upgrades. This seems to be easily remedied by installing the supported driver for ESXi, as explained and shown here. It is odd how these 2 network issues only affect 12 core (and maybe 16 core) Xeon D owners. One really has to wonder how this could be, given how similar they are when compared with the 8 core Xeon D-1541.
I suspect Intel is requiring Supermicro to not widely distribute the firmware fix, and/or the DOS tool EEUPDATE that implements the fix. Maybe it's just beta, maybe there are legal restrictions, I don't know for sure, this is just conjecture. It's important to note that for customers not willing to sign an NDA, Supermicro has offered to customers to ship their system to them, and they'll flash the X557 firmware for you. I don't have any confirmed stories of this actually happening though, with most folks electing to just sign the NDA.
What I'm focused on here is things that differ that could affect the way the BIOS and IPMI are configured. The idea is to see if there's some good reason that the X557 would apparently begin to run into trouble only on systems with more cores.
- Taller Heatsink - This assembly also includes a slightly bigger CPU fan.
- This would seem to only cause Supermicro to slightly tweak the factory default fan speeds for the CPU fan header on the SoC/motherboard, which is very unlikely to change the temperatures of the physical X557 10G interfaces.
- Slightly increased watt burn - Up to roughly 20% extra watts used versus 8 core models with the same clock speeds, and only when handling very heavy workloads.
- Stress would seem to cause the system to come slightly closer to using about a third of the 250 watt power supply that the CSE-721TQ-250B comes with, the pieces that make up the SuperServer SYS-5028D-TN4T bare-bones systems.
- It's hard to see how this would matter, especially since the X557 PHY should handle high ambient temps just fine, see the many ruggedized fanless designs. These network outages seem to happen during periods of inactivity/idle just as often as when the system is under load.
- Slightly lower 2133MHz for up 128GB of ECC DDR4 - Intel's architectural restrictions mean that only the Xeon D 8 core design allows the system to negotiate 2400MHz DDR4 speeds at POST, as confirmed in the BIOS and explained here and here.
- The 12 core, and all other Xeon D models (4, 6, and 16 core) negotiate 2133Mhz speeds, which is normal. The likelihood of noticing this speed difference during normal use is unlikely, perhaps up to a 5% difference that's likely only revealed with synthetic benchmarks.
- The advantages of an increased number of cores is a huge advantage for multi-threaded workloads. For pricing/marketplace reasons, Wiredzone cut over to 2400MHz for all SuperServer Bundles a long time ago, in the summer of 2016. I have no regrets in buying my 12 core system, I actively use mine hundreds of hours per month creating nearly all the content and videos here at TinkerTry.
- It doesn't matter if you have 2 (included with Bundles) or 4 memory sticks installed, 2133MHz is the max you'll get if you have anything other than 8 cores in your Xeon D.
- This Intel design restriction seems to be confirmed to be an industry-wide thing, seen on the various specs sheets here.
I'm an optimist about this sort of thing, even issues that drag on as long as this one has. With so many companies making Intel Xeon D systems out there, and many designs enjoying at least 7 years of product life and support, Intel and Supermicro are highly motivated to resolve this issue. Intel has historically had very robust firmware, and their track record of many VMware/Intel X557 driver/VIB releases these past 2+ years demonstrates how active the Xeon D market continues to be.
TinkerTry is All-in!
After weeks of failed X557 issue recreate at Supermicro HQ in San Jose CA, I'm shipping them my very own Xeon D-1567 SYS-5028D-TN4T to them. It just so happens to likely the very first such unit ever produced, but that's likely not relevant, as it appears 200 identical 12 core Xeon D-1567 motherboards were made in the same production run for Wiredzone. Since my system encounters the X557 issue within a day or two regardless of the workload, recreate at Supermicro shouldn't take long, and I'm even mailing them my Netgear XS708T switch too, just in case.
How long it takes Supermicro and Intel to develop a proper fix is another matter, but I'm doing all I can to accelerate that process. This isn't a simple matter.
Since my 12 core is my primary workstation & datacenter, to make this loan to Supermicro possible, TinkerTry.com, LLC has now invested in a 2nd identical
Supermicro SuperServer Bundle 2 12 core. Once I have my drives and primary Windows 10 VM moved over, and a recreate accomplished on the exact configuration I intend to ship, I'll be able to send the affected system off to California. Special thanks to Wiredzone for helping with accelerated cross-country shipping, and to all my advertisers for making TinkerTry's 3rd Xeon D purchases possible. I stand behind anything I put my name, and reputation, behind.
This new addition to my family will be a huge boon for my testing and staging and content creation, with one Xeon D-1541 and one Xeon D-1567 soon available to test and reboot, at will. Well, at least once my primary workstation returns, of course. Hopefully soon.
I've also collected several Supermicro SM#s for them to investigate, from various TinkerTry readers who have been pitching in.
Xeon D-1540 (8 core) and Xeon D-1587 (12 core) too?
Note that the first-generation Xeon D was called the Xeon D-1540, and there is now one report of the same X557 network-down issue by verdragan in this article's comments. That is an interesting twist, perhaps this will give Supermicro and/or Intel some insight into root cause.
You'll also noticed we now have a report from Xeon D-1587 owner takaze, with article above updated accordingly.
Finally, there's a 6 SuperServer Bundle owner out there, 4 8 cores and 2 12 cores. He's only experienced these X557 issues on the 12 core systems. It breaks my heart to ask him to avoid 10G for now as a workaround, but I'm confident we'll get this resolved, without him having to sign an NDA or ship his system to Supermicro to update it for him.
Supermicro has been in communications with me about this, and I might not need to ship them my system after all, with many folks now involved. I don't have any significant new developments to share at this time, unfortunately.
I've also reached out to an Intel spokesperson for assistance.
Preparing my system for shipment to Supermicro for recreate has proven to be much more challenging than anticipated. The problem is no longer happening on a daily basis, due to factors I haven't yet figured out. I will keep posting details on my tests right here.
During 3 weeks of heavy 10GbE testing on another Xeon D-1567 SuperServer (Bundle 2), the issue did not surface. Admittedly, I don't really know why, but using syslogging to my vRealize Log Insight, I'm able to confirm that I've had zero incidents of outages happened on that system, even though it was literally running the same OS, which is still ESXi 6.5U1 on USB, moved over to the temporary, new system.
In way this is good, and it could explain why Wiredzone has had less than a handful of reports of this problem, with most of those coming from comments on this article. I still don't know why this issue happens on only some 12 core Xeon D systems, and not others.
I managed to convince Supermicro to perform my X557 firmware flash remotely, as a pilot effort of sorts. This saved on shipping costs, and side-stepped the need to sign any NDAs.
First, I carefully and temporarily expose my IPMI interface's IP address on my problematic Xeon D-1567 to Supermicro support on a new, public IP. They were then able to access my system's IPMI (only) over https, using a non-default very long password. This allowed their technician to flash my X557 to firmware level SDV23B. Gladly, for me, this immediately resolved the issue. Completely gone. Not a single incident of my 10GbE network ports going down again in the last 16 days of careful 24x7 monitoring, with that happy news shared back to Supermicro too, of course.
The open question now is what to do about handling the other customers who are still on SDV23A, along with the one report of a customer on SDV23B but still having outages. For folks eager to avoid shipping charges and doing without their system for a while, who are also unwilling to sign the NDA, perhaps something like this procedure could be workable:
- contact Supermicro 24-Hour SuperServer Technical Support
- ask for the SDV23B fix for your Supermicro 12 or 16 core Xeon D system or motherboard, tell them your serial #
- come up with a mutually agreeable date and time for the upgrade
- prepare for the upgrade by temporarily removing/detaching all data drives
- put the IPMI IP address into the router's DMZ
- change the admin account's password to something long and complex
- inform the Supermicro technician that your system is ready for the SDV23B upgrade at https://*yourpublicipaddress (obtained from something like asking Google "what is my ip") and password longcomplexpassword*
- once informed the upgrade is complete, continue with the following clean-up steps
- take your IPMI interface out of the DMZ
- unplug power from the SuperServer for the changes to take effect
- insert/reattach the drives
- power up, watch the OS finish booting, see the green 10GbE LEDs illuminate
I'm having little luck with my multiple attempts to reach out to Intel and Supermicro in the past 2 weeks, but I'm continuing to work closely with Wiredzone, who continue to be very helpful each and every step of the way. I will continue to inform my readers of progress right here, in this same article.
Outages are much more rare with SDV23B
Unfortunately, after 31 days of no recurrence of this issue in my home lab, it happened again. I have reached out to Supermicro for next steps, but have not heard back from them yet. This is disappointing, and confirms what Devoid at TinkerTry here had previously reported.
I now have another report of a user experiencing at least 1 10G network outage per day on the original Xeon D-1500 that existed at launch: the Xeon D-1540. It's an 8 core, and it too has the network outage issue.
I'm looking into the firmware that's available straight from Intel here:
Intel® Ethernet Connections Boot Utility, Preboot Images, and EFI Drivers
Version: 23.1 (Latest) Date: 2/21/2018
Note that they fail to mention the X552/X557 chip that’s in Xeon D-1500 systems, but it does mention many other NICs in the same driver family.
Note that this does mention the X552/X557 chip that’s in Xeon D-1500 systems.
The X552/X557 shares the same drivers as the popular Intel X540 PCIe NIC, which I've used in my home lab on a Sandy Bridge Core i7 system with no outages for many months.
Here’s the relevant section of the output of
esxcli network nic get -n vmnic3
for each Xeon D system listed below, all running BIOS 1.2c.
Xeon D-1540 - had daily outages, now weekly
Supermicro SuperServer SYS-5018-FN4T with NIC firmware 22.9 (2017-11-03) from Intel
Bus Info: 0000:03:00.1
Firmware Version: 0x800001cf, 255.65535.255
Bus Info: 0000:03:00.1
Firmware Version: 0x800003e7
- Xeon D-1567 - had daily outages, now monthly
my system updated to SDV23B, a Bundle 1 Supermicro SuperServer SYS-5028D-TN4T 12 core
Bus Info: 0000:03:00.1
Firmware Version: 0x800006b7
I've been told today that all new Xeon D-1567 systems that Supermicro ships from San Jose CA to Wiredzone for resale already have the Intel X557 firmware fixes. I can also add that a newer Xeon D-1567 system that I had on loan back in January had zero incidences of 10GbE outages. The test period was 3 weeks of heavy testing as my temporary primary workstation, with careful syslog monitoring. That story should provide you with an additional level of reassurance that the issue is unlikely to be encountered by Wiredzone customers with Xeon D-1567 SuperServer Bundles delivered any time this year.
Finally, I have two new 4 core Xeon D-1521 stories to share. These are great examples of folks helping each other out, and seem to help confirm that even with BIOS 1.3, folks are still having some X557 networking issues. It's apparently not just Xeon D owners with 12 or more cores either, and it's not just folks using 10GbE switches either. This is a 1GbE networking story, on a SuperServer that only has two X557 ports, and no 1GbE ports.
Of course, I've shared these new stories with Supermicro as well, as it's a significant new spin on a 13 month old story, with a suggested work-around for VMware users using 1GbE switches, highlighted below.
The first story comes from Bruno Zeidan, in his comment at TinkerTry here (excerpts):
Nice post. Although, I have an issue with ESXi 6.7 on SuperMicro X10SDV-4C-TLN2F. NICs are recognized, but they are not getting link status updates or link is not going up. This is Xeon D-1521 which only has 2x 10GE network interfaces. Therefore, major issue as I don't have other means of network connectivity.
I've been waiting for ESXi native support for this card, but in fact, after implementing 6.7, now the network link status are not detected. (keep saying Disconnected).
I'm using Gigabit links although these are 10GE interfaces (it works and is supported).
Did you test the 10GE interfaces? Do they work in your case?
I was on BIOS 1.3 already. Actually, I managed to make it work, on ESX CLI, I did:
esxcli network nic down -n vmnic0
esxcli network nic up -n vmnic0
But, it didn't survive to reboot. Every reboot, I had to do the same. Also installed latest driver ixgbe 4.5.3. But still not working.
To fix the issue, solution was to set the speed manually for both NICs. Finally fixed.
esxcli network nic set --speed 1000 --duplex full -n vmnic0
esxcli network nic set --speed 1000 --duplex full -n vmnic1
My motherboard only came with two 10gbe X557, and I don’t have any gigabit Ethernet. This has made things a bit challenging, but I’m able to run commands via the console using IPMI. I typed the model wrong, as it’s actually a X10SDV-4C-TLN2F (Xeon D-1521).
I had tried reinstalling the VIB from Intel, but it still doesn’t work. ESXi recognizes the network adapters, but they’re reported as “down” all of the time. The cable is connected and the light is on.
I have tried simply re-installing 6.7, following your guide for a fresh install, wondering if my specific installation was corrupted... that again did not fix the issue.
I’m starting to wonder if there’s a bug preventing X557 network cards to work with 6.7 as management NICs? Although they are certified by VMWare to work, and the VIB lists 6.7 as supported.
Paul, thanks for sharing the last link, it seems exactly the same issue I had. Sorry for not noticing it earlier.
I have the same hardware Brian has, and, like him, I'm using gigabit network switches. Changing the configuration so speed is set to 1000 seems to have fixed the issue (and it survived a reboot). I had to do it using the ESXi Shell via IPMI (Alt + F1), which is quite awkward but worked. Sounds like it's a real bug.
Now, I just need to re-configure ESXi based on my docs, since I had to do a fresh-install (ouch).
PS: I did take the power off (and kept it off for a bit) twice, and it didn't work. I am already on the last BIOS (1.3) too.
Based on the changed nature of this issue, and some discouraging news I've just received about self-service fix options, I've had to update the title from:
- Temporary workaround to recover from intermittent Intel X552/X557 10GbE network outages on 12 and 16 core Xeon D, hoping for a public firmware update
- Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue
When I first wrote this article, I could have never know that would have been the right title at the time, the benefits of hindsight and valued TinkerTry feedback have been considerable. I'm very thankful!
For folks not interested in signing an NDA to do the firmware update themselves, I'm still communicating with Wiredzone and Supermicro to get clarification on the potential for an RMA process for Wiredzone customers, where the firmware gets flashed for you. I'll append this article with further information, once it's received.
As for me, with my October-2016-vintage Xeon D-1567 that's in my world's first TinkerTry'd Bundle 1 system, Supermicro performed the Intel X557 firmware patch a few months back. My recurrence of this issue (10GbE goes down and stays down) has been reduced to once every-other-month roughly. VMware's vRealize Log Insight has shown some brief (1 to 10 second) down/up event pairings, usually of both 10GbE ports. a few times a day. I've actually never noticed these events while using the system heavily as this is my primary workstation. I suspect they might actually be spurious false-positives, and if I aimed a camera at the network port LEDs 24x7, I suspect they actually never went off. But I don't really know for sure.
I received the following information about how folks affected by X557 10GbE network drops can get their system repaired via firmware flash done at Supermicro:
...the people who are responsible the barebone PM and motherboard PM and they won't provide any other way to apply the fix, unfortunately. This is because of their agreement with Intel, they are not willing to bend that. So any customer interested in having the fix for the network drop issue will have to ship the unit back to them, shipping both ways is still covered by Supermicro.
This is a disappointing "fix." Given I don't know what they do, and don't have release notes to know if this is any different than what was done to my system, I likely won't be sending in my unit, since I no longer experience outages often enough (less than once a month) for it to be a significant problem for me.
Please contact Supermicro Support directly to make arrangements. I don't have details about how these are handled outside of North America, but would love to hear how folks get treated who go through this return process via comments left below.
Unfortunately, the X557 link-loss problem has returned to my Xeon D-1567 system, happening at a rate of roughly once per day now. I have no idea if applying IPMI 3.68 somehow made this happen a little more often. That's a very long shot. I will be shipping my system in to Supermicro when I have an opportunity to do so.
One TinkerTry visitor has sent in a new report of his experience when getting the fix SDV23B, it worked for him, and that is good!
Unfortunately, he also later received this email:
I am sorry that our company doesn’t accept end user/personal NDA. You will have to bring the unit back for RMA. Please submit the ticket for RMA online.
That would seem to indicate that individuals that don't list a company name on the NDA forms may not have a way to self-service fix this issue themselves, remotely.
I decided to take Wiredzone and Supermicro up on their offer to flash my X557 firmware at Supermicro. I was provided a UPS ground label, and was without my system from 6/15/2018 to 7/10/2018, that 25 days in all. Good thing I had a secondary 8 core SYS-5028D-TN4T that I could use in its place while it was away for so long. It's not yet clear what they did to my system, as far as the exact firmware, but I can say that it was still the original motherboard that was returned to me.
I'm now also testing out newer ESXi drivers named ixben 1.7.1 for ESXi 6.7, and 1.6.5 for ESXi 6.5, more details here.
Unfortunately, today, I received the first report of an 8 core SYS-5028D-TN4T / Xeon D-1541 that experienced a 10GbE port going down. It was on BIOS 2.0 / IPMI 3.68, more details to follow.
Since my last update on Sep 30 2018, I'm not aware of further reports here at TinkerTry about outages on Xeon D-1541.
Unfortunately, based on numerous reports of E200-8D owners running into similar difficulties, and the never-ending saga, I've felt it time to update the title of this article.
Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network outages on Xeon D-1521/1540 and pre-2018 1567/1587; popular E300-8D/E200-8D 1518/1528 and 1541 never had the issue
Workaround and fix for intermittent Intel X552/X557 10GbE/1GbE network link-down outages on Xeon D-1500 Series
I have also received an incredibly detailed set of instructions in a comment below by blogthis, his entire post captured here below, verbatim:
blogthis • 2 hours ago • edited
After literally years of my tinkering with my system, and my scattered posts and replies to now somewhat-buried posts throughout the original articles comments.. it has become a bit unclear where to dig to find a concise "Latest Info" of consolidated findings.
So.. SUMMARY of my Stable setup this past year, an update all in one place, what I keep getting asked in reply post chains - I hope this helps clarify my fragmented findings as of today, my currently working config:
Mini-tower SYS-5028 Series:
SYS-5028D-TN4T-12C (MBD-X10SDV-12C+-WD002 in Chassis SC721TQ-250B)
Deprecated Prior 10G Firmware (Intel SDV-23A); ESXCLI = 0x800005ad
Upgraded New 10G Firmware (Intel SDV-23B); ESXCLI = 0x800006b7
((update from SuperMicro TW using "SDV23B_UEFI.zip"))
TPM 2.0 module (AOM-TPM-9665V-S)
Datacenter Management Package (SFT-DCMS-SINGLE)
IPMI = v3.86
BIOS = v2.1
ESXi = v6.7-Update3b
Disabled Intel "legacy" IGB-1G and IXGB-10G
Install-Enabled: Intel next-gen IGBN-1G (e.g. for I350 chips):
Install-Enabled: Intel next-gen IXGBEN-10G (e.g. for X557 chips):
- There are newer drivers available, but these are working for me.
HEAT from Qty-2 Enterprise SATA Intel DC S3520 1.6TB SSD (for supercaps).
HEAT from Enterprise PCIe Intel DC S3520 2TB (AIC for speed/supercaps).
HEAT from SuperMicro 128G SATA-DOM; FAST BOOT vs USB (SSD-DM128-SMCMVN1).
HEAT from Qty-2 of 10GBaseT LACP to NetGear XS724EM (firmware 126.96.36.199).
- This is for Synology DS3617xs SAN/NAS, latency-tolerant big files.
HEAT from Qty-2 of 1G LACP to Cisco SG-C300-20 (firmware 188.8.131.52).
- This is for OOB management, and internet needing minimum latency.
COOLING from high-quality Cat7 cables (avoids higher watt-heat output).
COOLING from mod adding high-CFM Notura front-inside intake fan.
COOLING from mod adding SuperMicro air-shroud (MCP-310-00076-0B).
COOLING from pre-installed CPU fan (only available in Mini-Tower config).
Important - Intel Spec Sheets vs SuperMicro IMPI (inconsistent!):
- The SuperMicro MB_10G temp IPMI before alarm is 100C, but seems wrong?
- Intel Ark spec sheets for 10G X520/X550/X552/X557, all say max 55C..!!
Quiet fan settings are only viable if low-load and NOT using 10G:
Standard fan speed, IPMI MB_10G Temp: 53C
Hot at ONLY 2C under thermal limit
Optimal fan speed, IPMI MB_10G Temp: 52C
Hot at ONLY 3C under thermal limit
Reliable fan settings if higher-load OR using 10G:
** For clarity, I suggest ONLY use these fan speeds
HeavyIO fan speed:
MB_10G Temp: 45C-46C (of 55C max)
Full-Max fan speed:
MB_10G Temp: 44C (of 55C max)
(It seems this is best my fans can do, but faster/louder fans are available)
- To achieve Intel specs under load, for me it needed fw, mods, extra fans, at high-rpm.
- All this emphasizes my opinion of an overheat root cause to this topic.. not just firmware.
I hope this helps someone, and I look forward to reading future thoughts.
- Supermicro SuperServer Xeon D-1500 Bundle mini-tower and 1U rack mount are already on the VMware Compatibility Guide for ESXi 6.7
Apr 19 2018
- Supermicro Xeon D SuperServer BIOS 1.3 released pretty much just for Spectre mitigation / IPMI 3.58 is still the latest
Mar 27 2018
- Supermicro SuperServer Xeon D-1500 Bundle mini-tower and 1U rack mount are finally on the VMware Compatibility Guide for ESXi 6.5U1
Mar 15 2018
- VMUG Advantage EVALExperience includes latest VMware vRealize Log Insight 4.5 syslog server appliance for easy vSphere, vSAN, IoT, and networking gear log file analysis
Oct 16 2017
- 10G networking for your Laptop - Promise SANLink3 T1 NBaseT Adapter blesses your Thunderbolt 3/USB-C desktop or laptop with 1.0/2.5/5.0/10GbE speeds
Jul 30 2017
- First look at Ubiquiti mPower Pro power strip, home lab pricing, enterprise features, uncertain future
Aug 17 2016
- Nice Canadian gives tour of world's first 16 core Intel Xeon D-1587 Supermicro SuperServer Mini-Tower
May 18 2016
- How to download and install the Intel Xeon D 10GbE X552/X557 driver/VIB for VMware ESXi 6.x, works with the X540 PCIe card too
Nov 02 2015
- Intel X710 NICs Are Crap
Feb 28 2018 by Bob Plankers at The Lone Sysadmin