Traditional “Four Corners” Testing
Our first test was to see sequential transfer rates and 4K random IOPS performance for the Kioxia CD6-L 7.68TB SSD. Please excuse the smaller than normal comparison set, but if you need an explanation, see above as to why we are not using legacy Xeon Scalable platform results.
Here we can see what Kioxia means by this being a “read optimized” SSD with the CD6-L. We actually get very good sequential performance. The one area of four corners testing that seems to be lower is the 4K random write testing which makes sense given the market. At the same time, we are getting better performance than we got on the original CD6.
Just taking a quick look at the relative performance of the CD6-L on the Ampere Altra Q80 platform as a representation of Arm CPU performance compared to AMD EPYC 7002 performance we get fairly good performance.
A few quick notes here. We were going to test these with the IBM POWER9 systems we have in the lab. Our test results looked distinctly wrong so we are not publishing them here. The test setup is less mature, so we basically found we needed to go back and re-look at the system setup. Second, the Ampere parts are in the Wiwynn Mt. Jade platform and this is not the final firmware as you will see on systems in Q1 2021. Third, in Q1 2021 we also expect to see AMD EPYC 7003 CPUs. We do have Milan CPUs that we obtained through authorized channels but told AMD we would not publish the results given the sensitivity of the SKUs we have. Milan will see an update to the AMD figures so expect both the AMD and Ampere/Arm figures to move up next quarter. Finally, Intel Xeon Ice Lake chips will arrive in 2021, we are expecting now in early Q2. So please take all of these Gen4 results as more of a point-in-time snapshot that there will be a lot of changes to over the next quarter and a half.
STH Application Testing
Here is a quick look at real-world application testing versus our PCIe 3.0 x4 and x8 reference drives:
As you can see, there is a lot of variabilities here in terms of how much impact the Kioxia CD6-L and PCIe Gen4 has. As noted in the chart, given some of these use x86 VMs, we have not ported everything to Arm at this point/ validated Arm-based setups so we are only showing two Ampere results. Let us go through and discuss the performance drivers.
On the NVIDIA T4 MobileNet V1 script, we see very little performance impact, but we see some. The key here is that we are being mostly limited by the performance of the NVIDIA T4 and storage is not the bottleneck. Here we can see a benefit to the newer drives, in terms of performance, but it is not huge. Perhaps the more impactful change here is the move from Gen3 x8 to Gen4 x4 frees up more PCIe connectivity for additional NVIDIA T4’s in a system, thus having a greater impact on total performance. This is a strange way to discuss system performance for storage, but it is very relevant in the AI space.
Likewise, our Adobe Media Encoder script is timing copy to the drive, then the transcoding of the video file, followed by the transfer off of the drive. Here, we have a bigger impact because we have some larger sequential reads/ writes involved, the primary performance driver is the encoding speed. The key takeaway from these tests is that if you are compute limited, but still need to go to storage for some parts of a workflow, there is an appreciable impact but not as big of an impact as getting more compute. Here, the CD6-L performed about where we would expect given the sequential read/ write numbers we saw.
On the KVM virtualization testing, we see heavier reliance upon storage. The first KVM virtualization Workload 1 is more CPU limited than Workload 2 or the VM Boot Storm workload so we see strong performance, albeit not as much as the other two. These are a KVM virtualization-based workloads where our client is testing how many VMs it can have online at a given time while completing work under the target SLA. Each VM is a self-contained worker. We know, based on our performance profiling, that Workload 2 due to the databases being used actually scales better with fast storage and Optane PMem. At the same time, if the dataset is larger, PMem does not have the capacity to scale. This profiling is also why we use Workload 1 in our CPU reviews. We see that the Kioxia CM6 is frankly faster than the Kioxia CD6-L. Our sense is that this is purposeful as the CM6 is a higher-cost per GB drive. For many, trading a few percentage points of performance is worth getting more capacity. If one can hold more VMs on the storage, then that can have a bigger TCO benefit than having slightly faster application performance. That is not true in all cases which is why we have the mixed-use CD6-L and the CM6.
Moving to the file server and nginx CDN we see much better QoS from the new CD6 versus the PCIe Gen3 x4 drives. Perhaps this makes sense if we think of a SSD on PCIe Gen4 as having a lower-latency link as well. On the nginx CDN test, we are using an old snapshot and access patterns from the STH website, with DRAM caching disabled, to show what the performance looks like in that case. Here is a quick look at the distribution:
Overall, we saw a few outliers, but this is an excellent performance. Our performance was again not as good as we saw on the Kioxia CM6, but it was much better than our baseline PCIe Gen3 SSDs. Perhaps the key takeaway is that the CM6 is faster, but the CD6 is a better value if you need capacity and are focused on reads.
We swapped the drives to an AMD EPYC 7742 platform and the Ampere platform. This gives us a 128 core/ 256 thread AMD platform and a 160 core/ thread Ampere platform. The application since we have linux and nginx, is fairly mature on Arm.
Here we can see a bit better performance than we saw on the AMD system during this test as we got further into the tail. AMD’s SMT design along with its chiplet architecture does have an impact so here we can see a small impact. In general, as we are testing the Q80-33’s it is not always the case where it is better than the EPYC 7002 series, but I will let Patrick discuss that in the full Altra review.
To us, the key takeaway is that as we migrate to the PCIe Gen4 era, it is going to become very difficult to recommend SATA over NVMe as drives like the Kioxia CD6-L (and likely soon joined by others) push PCIe Gen4 NVMe pricing down to meet SATA. With next-gen systems, we are going to have better support for NVMe drives so the close of SATA in the data center is coming.
Next, we are going to give some market perspective on a new variant before moving to our final words.