Kioxia CD6 Performance
We are going to first go into some unique testing that we had to do before we could even get to results. We are then going to discuss our results. If you read our Kioxia CM6 review, you will have seen this, but we still wanted to call out this work since it led us to create a new comparison set.
PCIe Gen4 NVMe Performance is Different Today
Something that was slightly unexpected, but perhaps should have been, is that PCIe Gen3 performance is not exactly the same on AMD EPYC 7002 “Rome” versus the Intel Xeon Scalable family. It is close, but there were differences. We even went down the path of taking our data generated on 2-socket systems and putting them on a single-socket Intel Xeon platforms. While we could get consistent Intel-to-Intel performance on the same machine, the performance was not as consistent Intel to AMD.
As a result, we realized that we needed to re-test comparison drives on an all AMD EPYC 7002 platform. These deltas are small. In terms of real-world impact, most would consider this completely irrelevant. They are generally within 2% which one can attribute to a test variation, but since that is not a 2% range across the board and more of a +/- 2% range, we had to make the call to not use legacy data for our review since we strive to get consistency. To that end, we had to investigate another aspect of AMD EPYC performance: PCIe placement.
Even within the AMD EPYC 7002 series, one needs to be cognizant of die placement and capabilities. A great example of this is that there are AMD EPYC 7002 Rome CPUs with Half Memory Bandwidth. In that piece we go into the SKUs and why they are designed with fewer memory channels.
It turns out that as we were testing PCIe Gen4 devices on AMD EPYC SKUs, the actual placement of the workload on cores, and therefore an AMD CCD, and the location of PCIe lanes actually matters. As you can see from the diagrams above, the CCD, RAM, and PCIe lanes can be far apart from one another on the massive I/O die (or IOD.) This is a very small impact when we run workloads (<1%), but we could measure it.
Further, we found that in some of our latency testing the 48 core (6 CCD) SKUs such as the AMD EPYC 7642 and lower clock speed SKUs such as the AMD EPYC 7262 were less consistent than the higher clocked and 4x or 8x CCD SKUs. Even with a single PCIe Gen4 device, all AMD EPYC 7002 SKUs are not created equal. This is less pronounced at single drives, as we are testing here, but it becomes a much bigger challenge moving to an array of drives which one can see even using PCIe Gen3 SSDs on EPYC platforms.
The reason this review took so long for us to publish is simply that this took a long time to validate and then decide upon a workaround. Since PCIe mapping to IO dies is not easy to trace on many systems, and we needed an x8 slot for both PCIe Gen4 SSDs as well as a Gen3 era SSD (the PM1725a), we ended up having to build our test setup around a single x16 slot in our Tyan EPYC Rome CPU test system, and then mapping workloads to AMD’s CCDs around that slot. We also used the AMD EPYC 7F52 because that has a full 8x CCD enablement with 256MB L3 cache while also utilizing high clock speeds so we did not end up single-thread limited in our tests.
Again, these are extremely small deltas but very important. They will also, therefore, vary when one looks at Arm players such as Ampere (Altra pictured above), Huawei (Kunpeng 920), Annapurna Labs/ Amazon AWS (Graviton 2), NVIDIA-Mellanox Bluefield, and IBM with Power9 / 10. Soon we will have Ice Lake Generation Intel Xeon CPUs to add to this list as well as AMD’s EPYC 7003 “Milan” generation. The bottom line is that once we move out of a situation where Intel Xeon has 97-98%+ market share, and we have NVMe SSDs, this all matters. With AMD now at over 10% market share, this is quickly becoming important. It is also something that we have been testing for years, including in our Cavium ThunderX2 Review years ago. The difference in PCIe controllers and chip capabilities is well-known in the industry and is something STH has been looking at for years.
A downside of this is that we took a lot of testing then building around a single PCIe Gen4 slot we became serial in our ability to test rather than parallel which was unpleasant when you also cannot use historical data in the comparison. Still, to test the Kioxia CD6 (and previously the CM6) we had to get to this level of detail to have a valid comparison to other drives. Using our Xeon-based PCIe Gen3 test results would have been untenable.
Having a full set of 24x SSDs which is a more common deployment scenario, would mitigate the need for the above testing, but since we were looking at a single drive, this became important, especially with application-level testing.