Custom Firmware for Mellanox OEM Infiniband Cards – RDMA in Windows Server 2012
This article will walk you through the process of building and flashing custom firmware to a re-labeled Mellanox Infiniband card from Dell, Sun, or HP. With the custom firmware, you will finally have RDMA in Windows Server 2012, increasing your file sharing throughput to 3,280MB/Second and nearly 250,000 IOPS. A common issue these days is that users of OEM Mellanox Infiniband cards do not have access to firmware revisions enabling RDMA. The performance benefit is significant.
What, no RDMA from my Infiniband Card in Windows Server 2012?
Seeking better file sharing performance, you install a Dell, Sun, or HP branded 40 Gigabit InfiniBand card in your Windows Server 2012 machine, one of the cards Based on the Mellanox ConnectX-2 hardware. The latest Microsoft OS includes a built-in IPoIB driver for these cards, so it looks like you are ready to go with just a reboot. You assign an IP address, configure your shares, and then test throughput. You run an IOMeter benchmark from a client machine to your file server and see results like the screenshot below – fast but not fast enough.
The test result of 1,958MB/s means that your 40 Gigabit card is delivering only around 15 Gigabits. So what is happening to all of that bandwidth you bought? To diagnose, you break out a PowerShell window. A quick Get-NetOffloadGlobalSetting shows that NetworkDirect is enabled, which means that you can use RDMA – if your card supports it.
Running Get-NetAdapterRdma shows that the card itself is configured to use RDMA. So why isn’t it working?
Even with a properly configured system – and Windows Server 2012 is properly configured by default – you won’t actually get RDMA if the firmware on your card doesn’t play well with the OS. To complete your diagnosis, the critical PowerShell query is Get-SmbServerNetworkInterface, which (see below) clearly shows that our InfiniBand card is not capable of IPoIB RDMA with Windows Server 2012. There is more detail in the Windows logs, but we don’t need it; we already know what’s going on.
Stale Firmware is the Problem
While our card promises RDMA on the spec sheet, it turns out that you need Mellanox firmware version 2.9.8350 (or greater) in order to use it on Windows Server 2012. Jose Baretto has the details right here. You can verify the firmware version for your card in a number of ways. Perhaps the easiest is the Windows device manager, as shown in the screenshot below, which shows a card with firmware version 2.9.1000 that does not support RDMA.
Can’t we just download a Firmware updater?
Mellanox provides (as of this writing) firmware version 2.10.720 in its Windows 2012 installer, but that installer will not update third-party re-labeled cards. The latest firmware version available from Dell and HP is 2.9.1000, which does not support RDMA. I have a few Sun cards as well, and they arrived running version 2.7.8130. Without updated firmware, we cannot use RDMA, but the vendors have not (so far) updated their installers.
Custom Firmware is the Solution
Fortunately, there is a solution: Create and burn your own firmware. It is actually much easier than it sounds. Your first time may take 30 minutes. After that, it’s a two minute procedure at most. We will begin with the Infiniband card installed, using the built-in Microsoft driver, and with IP address information configured, as above.
Steps to Create and Burn firmware version 2.10.720:
- Install Mellanox WinMFT
- Retrieve card Device ID
- Retrieve card Board ID
- Download .mlx file
- Download the .ini file
- Create and Burn new firmware using mlxburn
Firmware Burning Steps in Detail:
1) Install the Mellanox WinMFT software package. This gives us the tools we need in order to create and flash the firmware. At the time of writing, the latest version is 2.7.2 and the installer is named WinMFT_x64_2_7_2.msi.
2) Now we need to retrieve some information from your card. At a Windows prompt, run the command mst status to retrieve the PCI ID of your card. For my card, the ID is mt26428_pci_cr0, as shown below; yours will likely be the same unless you have multiple cards installed. By the way, the number 26428 is the Device ID (a product identifier) for the Dell mezzanine card. You may notice that it’s the same Device ID as some of the Sun and HP cards and a Mellanox ConnectX-2 dual-port QDR card, an indication that our Dell card is indeed a standard Mellanox product, albeit with Dell specific firmware.
3) Now that you know your card’s PCI ID, we need to verify a few other card attributes. At the same command prompt, run the command flint -d <PCI ID> query (in most cases that will be flint -d mt26428_pci_cr0 query) and note your card’s Board ID. In the screenshot below, we see a Dell card which has a board ID of DEL09A0000009 (ignore the parenthesis). Some of the Sun cards have Board ID SUN0170000009.
4) Download the raw firmware file to a folder on your Infiniband server. The raw firmware file is a large text file with a .mlx extension. I used a version 2.10.720 firmware file named fw-ConnectX2-rel.mlx that I retrieved from the Mellanox 4.2 driver installer for Windows 2012. You can download my firmware copy from right here. If you don’t want to download my version, you can extract your own firmware file from the Mellanox installer. Start the installer, leave it running, and look in the folder c:\users\<username>\appdata\local\temp for a file with the .mlx extension (thanks to ServeTheHome member seang86s for the tip).
5) Download the .ini file as well, and place it in the same folder as the .mlx firmware file. My version for the Dell PowerEdge C6100 mezzanine card is right here. The .ini file must match the Board ID, so you’ll need to name it DEL09A0000009.ini. Skip to step six if you downloaded my version. If you don’t want to use my firmware file, you can create your own by editing and renaming the .ini file for the equivalent non-relabeled Mellanox board. With the Mellanox installer still running, find the file named MHQH29C_A1-A2.ini in one of the c:\users\<username>\appdata\local\temp folders. Edit the contents so that (for a Dell card) the attribute Name = DEL09A0000009 and PSID = DEL09A0000009 and then change the file name to DEL09A0000009.ini. If you have a Sun ConnectX-2 based card, just replace the Dell Board ID with the Sun Board ID in both in the file attributes and the file name.
6) You are now ready to create the new firmware image and burn it to your card. These two steps take place with one command, and must be done on the server with your Infiniband card installed. Open a Windows prompt and navigate to the directory containing your downloaded files. From the prompt, enter the command mlxburn.exe -dev <PCI ID> -fw <firmware file path>, which in our case is mlxburn.exe -dev mt26428_pci_cr0 -fw fw-ConnectX2-rel.mlx.
When you run the command, mlxburn queries your card to find the Board ID. It then looks in your folder for a .ini file with that Board ID – and finds it since we created one. Mlxburn then uses both the .mlx firmware file and the .ini file (along with additional information from the card, I suspect) to create a firmware image and then burns the firmware to the card. When complete, reboot the server to make the new firmware take effect.
Check Your Work
To verify the new firmware version, open a Windows command prompt and run the command flint -d <PCI ID> (flint -d mt26428_pci_cr0 in our example). Look at the firmware version to verify that it is 2.10.720, or whatever version of mlx file you used.
Now verify that your card is RDMA capable. Open a PowerShell window and enter Get-SmbServerNetworkInterface. Like the screenhot below, your Infiniband ports should now show as RDMA capable.
Finally, run another throughput test. This time we see a far more satisfying 3,279MB/s – 25.6 Gigabits of actual file sharing throughput.
Even more impressive is the incredibly low latency that IPoIB with RDMA gives us. An IOMeter test with 4kb random transfers shows just .51ms average latency and nearly 250,000 4kb random IOPS – from a circa 2009 Windows file server!
The file server for this article was a Dell PowerEdge C6100 XS23-TY3 node with dual Intel Xeon L5520 CPUs and a Dell QDR Infiniband mezzanine card (under $200 here.) I used a fresh Windows 2012 server installation with all default settings. For testing, I needed a file system faster than an Infiniband card – which isn’t easy. To achieve this, I ran the free StarWind RAM disk software on the file server and configured four 8GB RAM disk volumes. When tested locally with IOMeter, these RAM disks were capable of over 9GB/Second of throughput, which is more than enough to keep up with a single Infiniband card. These ultra-fast disk volumes were then configured as standard Windows shares.
The test client was another identical Dell C6100 node, connected to the server with a Mellanox Grid Director 4036 Infiniband switch. After mounting the four shared volumes across the IPoIB network, I ran the throughput and IOPS tests using IOMeter running on the client node. IOMeter was configured with four workers. The Access Specification for the throughput test was: 1MB transfers, 100% random and 100% reads, with all other settings left at their default values. For the IOPS test, the transfer size was 4kb. I used a test file size of 16,000,000 sectors on each of the four disks and a queue depth of 32. In IOMeter, you can test across multiple volumes by control-clicking on them to multi-select. All other settings and configurations were left at their defaults. For example, the Windows firewall was left running and large pages were not enabled.