How Was This Missed?
In all of my interactions with WD personnel, they have been top-notch. So how does something like this get missed? WD advertises that it does extensive testing, on the WD Red line in NAS scenarios. It knows it has NAS vendors using ZFS. It also knows that ZFS does not play well with DM-SMR. Drive vendors had previously had these WD Red DM-SMR drives on their compatibility matrices/ hardware compatibility lists. Indeed, some still do. So how can something that we can test and find, and readers are seeing in real-world usage, slip through the cracks?
A few quick data points:
- iXsystems (the company behind FreeNAS/ TrueNAS) has disqualified the drives in April 2020 after shipping them in September 2019, and having moderators in their forums warn users against them in the summer of 2019.
- Synology, a non-ZFS NAS vendor, has now changed course listing the 2TB and 6TB drives as incompatible.
Again, we have focused a lot on ZFS, since that is what we use in our NASes, however, this is Synology who is not using ZFS in these NASes listing the WD Red DM-SMR WD20EFAX (2TB) and WD60EFAX (6TB) drives as incompatible with their product lines.
That Synology “Incompatible Models” listing has over 60 different NAS models that are incompatible. Synology has both traditional RAID at different levels along with Synology Hybrid RAID. Just for these two drives, testing different RAID levels across 60+ platforms, and even eight different RAID levels mean that there are over 1,000 configuration permutations. Those are just two drives, from one hard drive vendor, and a single drive line. As one can imagine, it is nearly impossible for NAS vendors to actually test every drive, NAS, and RAID-level permutation.
As a result, these companies need to do something that is very common in the technology industry. That is to create a test matrix. That matrix is built upon assumptions on what permutations must be tested to consider sets of drives as qualified.
Assuming that WD actually tested these drives and was OK with the single drive and traditional RAID performance drops, they may not have tested ZFS. While they may have had employees that knew DM-SMR would not be a good fit for ZFS NASes, that knowledge had to be introduced into the test matrix. Also, if a simple workload was used that fit into the CMR area of the DM-SMR drives instead of a thorough real-world test, then they may not have seen the issues. For WD, they have a smaller set of drives to test, but they have a large number of NAS usage scenarios to test.
On the NAS vendor side, one of those test matrix assumptions seems to have been that every drive in a line uses the same recording technology. Since that was the experience for many NAS vendors, it was a reasonable one. If you have a new drive, such as a WD Red, perhaps you test the new larger capacity models instead of the smaller traditional 2TB-6TB models that you have used for generations.
A NAS vendor would not necessarily test these smaller drives unless WD had told them about the DM-SMR change, for the lower capacity drives in the new WD Red series. If that was the case, then the NAS vendor test matrix coverage would test the higher capacity CMR drives and skip the lower capacity points.
All of this sounds reasonable, and like what may have happened here. There is a considerable risk to this information not passing properly if the engineering teams are not the ones working with customers. One can imagine how the WD Sales Reps for given NAS vendors never received the memo from Product Management that Engineering said there was a tweak to certain drives. As a result, that information never passed to the NAS vendors to update test matrices.
Even after the drives are qualified, because of how DM-SMR drives are designed to have CMR areas for effectively caching random writes, system-level testing may miss this. A system may pass basic functional tests at the NAS vendor by only hitting cache areas with burn-in workloads. The unit is then shipped to a customer. The customer may take quarters or years to store appreciable amounts of data on the drives or see a failure causing a rebuild scenario. Unless very rigorous testing is done at the time of manufacture and installation, DM-SMR technology can effectively hide the drive’s important performance characteristics for a long time through the chain.
What is Going on at WD?
This brings us to the question of why it is now June 14, 2020 and we still have not heard anything from the company since April 22, 2020 even with the italicized “Continue checking here for updates regarding our WD Red NAS Drives” emblazoned across the page.
How do we reconcile what we have learned and the interest this story has generated, with no new updates for almost two months?
There are two main ways. Generally, I like to assume the best of people. WD people are very sharp individuals. While they could just be hoping to weather a negative news cycle and keep shipping WD Red DM-SMR drives to customers hoping it does not impact sales in the long run, that feels too far given what we have seen.
Perhaps the most interesting aspect is that as of publishing this article, nobody at WD has contacted me. Huge numbers of folks in the market, including NAS vendors and end customers, have seen our pieces. We from time-to-time publish these investigative pieces since we are an independent site. Usually, we are met with large meetings in-person or on the phone. At a minimum, most large companies will have a point person reach out in a few hours after a piece goes up. Even if that is simply to stall and set up a meeting for a few days in the future. After 11 years of growing STH, I can say there is a fairly standard process.
Part of my process here has been to wait. Prior to running STH, I did management consulting at one of the world’s largest professional services firms. During years of consulting, I was able to see many companies large and small including several in the storage space. One gets a good sense of how corporate structures dictate crisis response from management consulting. A reason companies to not act, is that they are often waiting for a decision.
Looking at symptoms we covered here:
- WD has individuals who are intelligent and passionate about their work
- WD knew, or should have known, that ZFS NASes would not function properly with the WD Red DM-SMR drives
- That information did not flag the issue in or for WD’s internal testing
- The DM-SMR change seems not to have made its way to NAS vendors that would be impacted by the change
- Multiple NAS vendors did not hear from their WD contacts about the DM-SMR change. Especially important because it was only to a portion of the new line of Red hard drives. Accordingly, these NAS vendors did not update their test matrices
- WD has been unable to coordinate even a “Hi, we saw your piece, can we chat?” reach-out (usually a symptom of not having a message to follow-up with)
Generally, these are symptoms of having a corporate structure and culture that is functionally aligned, rather than product aligned. Manufacturing has a pipeline of product in-progress and shipping. Product, marketing, and sales teams have unit and revenue targets. Finance, Customer Service, Legal, PR, and other functions will need to sign off on messaging and plans of action. Instead of there being an individual with a general manager or GM-like responsibilities, a problem like this WD Red DM-SMR issue requires many people to develop a response. More people involved require more touchpoints and approvals which slows response times unless a cross-functional executive drives teams to a course of action.
My sense is that this is a brand issue for WD, but the Red 2TB-6TB line is not the biggest for the company. It is not big enough to drive swift action across functionally aligned groups. The alternative is not one worth entertaining since that paints a poor picture of human empathy.
We are getting to the precipice here. Either WD is going to go down the path of staying silent, or they are going to stop selling these drives and get a rectification plan in place. Beyond consumers, it is also the NAS vendors and resellers that were willing to recommend WD drives based on historical greatness (and promises of MDF) to their customers who are hurt. Staying on the continued course of silent inaction will not help either the reputation of the WD Red brand or the brands of those NAS resellers who sold these drives to their customers.
For our readers who have systems with these WD Red SMR drives, see Rob’s story and his blog post. It may be worth contacting WD directly since they seem to have a support process. Also, develop contingency plans for how to deal with these drives in your environment.
Even if you are not impacted, if you know of clients, friends, family, or others using NAS drives, share your opinion about the SMR drives. Use our data, use Jim Salter’s data, or whatever you would like, but have those conversations. The biggest danger to the community today is not those who know about DM-SMR. Instead, as a tech community, we should take it upon ourselves to help those who are casual NAS users, but who do not understand the impacts of using SMR technology where it does not perform well.