Relying Upon Hard Drive Service Contracts or Think Like a Cloud Provider

15
Seagate Exos X16 16TB Hard Drive Cover
Seagate Exos X16 16TB Hard Drive Cover

When purchasing servers, customers often get a choice of warranty options during the process. Various extended durations and response time features can be offered and can add extensively to the overall cost of a purchase. However, buying an expensive warranty from your chosen server vendor is not the only option available, and if you purchase a server from a smaller vendor, you may not have the SLA you want. In this article, we will explore alternatives. Specifically the notion of self-warrantying which is what many cloud providers and major OEMs do themselves. For many organizations, this approach can yield significant benefits. Most of this article will be focused on hard drives, but a similar methodology can be applied to different components or even entire servers as well.

OEMs and Cloud Providers vs Manufacturer Warranties

When major OEMs like Dell, HPE, or Apple sell you a hard drive, they typically purchase those drives without a warranty from the manufacturer, only taking a guarantee of failure rates. By doing this, they receive discounted pricing and take on the service warranty from the drive vendor. This is the reason why you typically must contact your OEM for warranty support. Even though the drive may say Seagate or WD, it is serviced by the OEM that sold the server. OEMs cover the warranty for the drives and other system components with the markup from the system and by selling you a warranty for the whole server.

HGST Ultrastar DC HC510 10TB
HGST Ultrastar DC HC510 10TB

 

There is another aspect to this which is that large OEMs typically have their own tested and validated firmware. This firmware is often tuned by the vendor to work specifically with their own systems and the vendors build warranty assumptions around systems using drives with specific firmware features. Increasingly this firmware and data is being used by OEMs to perform predictive analytics and determine when a device is about to fail so a replacement can be dispatched before the device fails.

Likewise, when a hard drive fails at a major cloud provider, that is often purchasing drives by weight, they do not send each drive back. Instead, this is baked into their purchase agreement and spares are part of the overall discount and volume purchase discussion.

This contrasts with when you acquire drives from retail, where the individual drives have warranties serviced by their manufacturer. Purchasing whitebox storage or servers from some resellers may also rely upon manufacturer warranties, especially with smaller and lower-cost resellers.

Anecdote: Service Contracts Are Not 100% Guaranteed

This falls into the category of an anecdote and represents an extremely unlikely scenario, but it did happen to me personally so I feel like I can tell this story. Approximately 8 years ago, I was operating a couple of 48-bay SANs from a big-name vendor. When those SANs were purchased they came with a 4-hour onsite service contract lasting 5 years. We thought this was a path to never worrying about hardware.

At some point early in their life, a drive failed in the SAN being used as the primary unit, and we rang the vendor for support. They were very willing to help us out, of course, but there was a problem; they simply did not have any drives available. If you remember back to late 2011, flooding in Thailand had a large impact on hard drive availability and our vendor was impacted. As a result, they could not deliver within their 4-hour replacement window; in fact, it took nearly two weeks before a replacement drive arrived. During those two weeks, the degraded SAN suffered multiple additional drive failures which consumed all the internally configured hot-spare disks and began to pose a critical risk to loss of the array. By the time replacement disks started arriving, we were seriously contemplating pilfering disks from the secondary SAN unit to stave off disaster on the primary device. That is not a situation we expected to be in with a 4-hour replacement service contract.

Clearly, this story is unlikely to repeat itself, but it is not impossible to think up a scenario where replacement disks from a vendor might be delayed for one reason or another. This incident was the original genesis for the “buy some spares” purchasing philosophy which led to this article being written. After getting perilously close to losing my primary SAN, we purchased our own cold-spare to avoid ever being in that situation again.

Drive Failure Rate Assumptions

First, it would be a good idea to determine the likelihood of a drive failure in the first place. Obviously the failure rate of a hard drive will vary from model to model, but for the purposes of this article, we are interested in some kind of average. Backblaze, an online backup company operating over 100k hard drives, has been collecting and providing quarterly statistics on their fleet of disks for years now. They have enough drives to provide some real data we can use without relying upon anecdotal evidence. According to their Q3 2019 report (link) the overall Annualized Failure Rate (AFR) across their entire fleet of disks, across the entire lifetime of those disks, is 1.73%.

Backblaze Storage Pod
Backblaze Storage Pod V1

The hypothetical system we will be considering will have 8x 8TB drives. Assuming each drive has an individual 1.73% chance of failure in any given year, you are looking at around a 13.03% chance that our 8-drive system will suffer a single failure in any given year. Over a 5 year lifetime, there is approximately a 50% chance that at least one disk will fail, and around a 25% chance for two disk failures. Of course, these are just probabilities – throw in some good or bad luck and your experience could be very different. For the purposes of this article, let us assume we have a bit of bad luck and will suffer two drive failures in that five year period.

Author’s note: I have corrected my math on the probability of failures here. Thanks to the comments section for pointing out I’m a bit bad at math!

Drive Cost: System Vendor vs CDW

Let us take a look at Dell EMC PowerEdge T640 hard drive pricing by way of example since they are the largest server vendor in the world at the time of this writing. We are going to use 7.2K rpm 512e SATA hard drives of 8TB capacity to use for our comparison as they are a fairly common size.

Dell EMC PowerEdge T640 HDD Pricing 2020 02 23
Dell EMC PowerEdge T640 HDD Pricing 2020 02 23

Of course, for the big OEMs your warranty does not come from the drives themselves but from the overall warranty on the server you purchase them with, which can add its own costs. Using Dell as an example, our hypothetical system with 8x of their 8TB drives listed above was $9107.95 with a 3-year next-business-day service contract, and moving to a 5-year term brought an increase of $460.90.

If we turn to the retail drives from CDW, a large US IT distributor/ reseller we see very different pricing:

Drive Name Drive Model Warranty Cost Vendor
Seagate Exos 7E8 4TB 7.2K 512n ST4000NM0035 5 years $166.99 CDW
Seagate Exos 7E8 8TB 7.2k 512e ST8000NM0055 5 years $287.99 CDW
Seagate Exos X14 10TB 7.2k 512e ST10000NM0478 5 years $379.99 CDW
Seagate Exos X14 12TB 7.2k 512e ST12000NM0248  5 years $441.99 CDW
Seagate Exos X16 14TB 7.2k 512e ST14000NM001G 5 years $538.99 CDW
Seagate Exos X16 16TB 7.2k 512e ST16000NM001G 5 years $589.99 CDW

These drives will not all be an exact model match obviously, but they should be comparable in their capabilities to their OEM brethren. In comparing retail drive costs to Dell (and this usually works for Lenovo and HPE as well), it is less expensive to buy 16TB drives at retail than it is to buy 8TB drives from the big system OEMs. 

Our 8TB 512e 7.2K rpm hard drive is $287.99 from CDW and $636.61 from Dell EMC for a delta of $348.62. Or put another way, you can get 2.21 CDW 8TB drives for every 1 Dell EMC drive. This is on a component with a failure rate of only 1.73% per year. At 1.02:1 we would be at a virtual tie, but at 2.21:1, this can be a significant source of cost savings.

Self-Warranty Through Cold Spares

With the standard warranty process for a retail drive, the turnaround time for getting a disk replaced via RMA can take weeks. One way to sidestep this is to simply purchase a cold-spare drive or two when the disks are originally purchased, paying for the drives out of the cost savings achieved by buying retail in the first place. In our example above where 8x 8TB drives were purchased, with Dell that would have cost $5092.88, plus the extra $460.90 to bring the warranty up to 5 years. Buying the retail drives was only $2303.92, resulting in a cost savings of approximately $3250. Put another way, for every Dell drive you purchased with this warranty, you could buy two CDW drives and still save around $1000. That gives you eight cold spares and eight running drives plus the opportunity to RMA drives and keep your spare pile stocked. You would have the drives immediately on-hand in case they are needed.

Synology DS1517+ Seagate Ironwolf HDD
Seagate Ironwolf NAS HDD

While this, again, is not for everyone, a 1:1 installed to cold spare ratio is needlessly high. We expect only about a 25% chance that two of the eight drives will fail in five years yet we are provisioning 8 cold spares; a more reasonable general recommendation would be 2 or 3 spare drives.

Another important aspect here is that hard drive prices tend to fall over time. Today’s 16TB $589.99 drive in five years is likely less than half that. STH had an article focusing on the math behind this logic years ago in Internal or External Hard Drives: Are Warranties Worth the Cost? You can see the analysis there using future value discount rates and an even higher AFR (5%) using no-warranty external drives for comparison. 

Buying drives up-front as spares helps to protect against events such as the Thailand flooding where buying drives on the open market becomes challenging.

Clear Benefits of Vendor Support

With all the cost difference there are obvious differences in the method where warranties are serviced. Your 3 or 5-year warranty with Dell includes the cost of a technician who will come onsite to replace a faulty disk. Depending on your physical proximity to your server equipment, this convenience can be worth a lot. If your server is in a data center hundreds of miles away, servicing it yourself may not be a viable option or may incur additional charges for remote hands support. For many organizations, this is a key benefit. Big vendor service organizations often have harrowing tales of the extreme places they have replaced hard drives.

Most retail purchased drives are serviced by sending them back to the manufacturer and then receiving a replacement drive in return which can require two trips to the data center or two sets of remote hands. This may seem like a small detail, but it can be important when working with many drives. Dell EMC drives and replacement drives will come with an appropriate sled/ carrier while a retail bare drive will not. One may have to replace the drive in an existing carrier or find another carrier adding to the cost of drive replacement. 

For many organizations, this is all that matters and that is why these agreements are so popular.

Self-Warranty Not Just Limited to Hard Drives

This article has been all about hard drives, but the concept of keeping spare equipment around is not limited to disks. Especially as equipment ages or exceeds warranty terms, keeping some spare parts around can be a good idea. Power supplies are another semi-common point of failure and are a relatively inexpensive investment to keep a spare. For 1-2 server installations, having cold spares on hand can seem like a waste. At larger installation sizes the cost becomes relatively small. If you were to look at a hyper-scale data center, they are not waiting for a server vendor to replace a part. Instead, they have spares on hand that their staff can use for replacement. They are buying high-quality but lower-cost servers to ensure this model works.

When it came time to put in new core switches in my datacenter, the decision was made to buy less expensive switches without a service contract, and use the savings to buy an entire extra switch. The extra unit was given a generic configuration and racked in-between our two core switches, ready to take over for either in the case of failure. The same logic can be used to order another server. Spare server capacity immediately available in a rack is invaluable in a failure scenario.

The Decision

Planning how to configure the servers you buy and whether you buy the extra warranty contract or not is a multi-factor decision and a risk management balancing act. The OEMs offer convenience and relatively quick service at a sometimes-steep cost. Indeed, buying a complete support package is the risk-averse way to make it someone else’s responsibility to deal with failure. For the vast majority of businesses, that is the model they want and frankly, the model vendors such as Dell EMC, HPE, and Lenovo cater to.

For those who are extra cost-conscious or who are averse to the risk similar to what we had during the Thailand flooding where the large vendor could simply not get a drive even under a support contract, then self-warranting can make sense if the structure is in place to replace the drive or other components in the data center. There are a lot of variables in this equation but navigating those variables can help manage risk while potentially offering greater than 50% cost savings.

15 COMMENTS

  1. > Today’s 16TB $589.99 drive in five years is likely less than half that

    I don’t know. Taking the price development of 4 TB NAS retail HDDs in the last years as a reference (retail prices in the EU), i would expect the price to decrease by about 1/3 within the first two or three years, and then largely stay the same. However, much of such price development will of cource also depend when and what new mass storage devices with higher capacity and/or performance will be introduced in the retail market (in other words, how long 16 TB will remain a standard purchase choice in retail).

  2. Addendum to my former comment:
    I just checked price developments for Enterprise HDDs in the last years here in the EU. In contrast to NAS HDDs, enterprise HDD prices really seem to drop 50% of their price within 5 years or so. Well, i didn’t expect that 😉

  3. An excellent article.

    The catch with purchasing a “same make/model” drive down the road would be possible changes in PCB/Chips and/or Firmware.

    In our experience, the up-front cost of same PCB and firmware is worth it for those cold spares.

  4. I second that, a great article!
    Please more of those insightful and invaluable tips!
    And too for rewarding the turning off of my adblocker with non-obstructive ads. A win-win!

  5. Thanks all. This was Will’s first article for STH. A very interesting topic.

    Steven – that is what we aim for. Getting even progressively fewer ads and I have told our external marketing sales team that I do not want to see popups and such.

  6. You are so right Will! My own epiphany came when I was asked to participate in a focus group for Sun Microsystems (remember them?) talking about extended service contracts.
    The assembled focus group were probed about service contract pricing – what if they were priced at 15% of the server purchase price per year? How about 12%? (note: I can’t remember the exact numbers).
    The answer from the wizened pros in the group – I wasn’t one at the time – was simple: Nope, I’m never going to buy a service contract. I just buy a few extra servers – saves a TON of money, and my response time is better, since I just have to run to the spares closet.

  7. sorry, but I stopped reading after: “The hypothetical system we will be considering will have 8x 8TB drives. Assuming each drive has an individual 1.73% chance of failure in any given year, you are looking at around a 13.8% chance that our 8-drive system will suffer a single failure in any given year”

    This is simply not like that – what final percentage would you write if you had, lets say 100 drives? That the percentage of a failure in any given year is 173%? What would that number mean? 🙂

  8. > This is simply not like that – what final percentage would you write if you had, lets say 100 drives? That the percentage of a failure in any given year is 173%? What would that number mean?

    Good catch, a more precise chance of a single failure for an 8-drive system would be 13.03% (1-(1-1.73%)^8), and 82.54% 100 drives.

  9. Great article, however some vendors obviously don’t let customer get out of the hook that easily and put more obstacles in using 3rd party drives in their system. Investigating purchase of older HPE system, I’ve been warned that if I do not use original HPE drives, then fans will spin on 100% all the time producing unbearable noise. In case of HPE drives, system looks to be very silent. I’ve considered HPE 380p g8 that time. I guess HPE policy is still the same.
    Another note: what applies to drives also seems to apply to RAMs aswell. Tryied to configure for example Dell Precision 5820 with 512GB RAM and was surprised to see 8k,- EURs just for RAM when the RAM on common market is about 3k,- EURs. Usually RAMs when good, they stay good….

  10. @KarelG
    This despicable price gouging is why I prefer barebone server vendors like Gigabyte over traditional server vendors.

    I wonder how hard it would be to modify commodity DIMMs to report as HPE/Dell/Fujitsu originals, I mean the SPD circuit is usually just a small EEPROM / flash chip.
    If so, it may be possible to take ordinary modules from Micron/Samsung/Hynix etc. and just flash copies of the SPD of an original part. Perhaps one would need to hack around the serial numbers being identical, and then worry about checksums. I mean unless the DIMMs have a digitally signed serial number on them, it should work.
    Even if the SPDs were individually signed, if one has multiple servers, one could buy a full compliment of DIMMs for one, than populate the rest of the servers with flashed commodity DIMMs.

  11. Very good article.
    Just one remark one the statistics if there’s 1.73% failures for 100k disks we should consider the same ratio whatever the number of disks used. However the way to use this average is to put as exponent the number of years of usage. Therefore for 5 years you will have 15.496% (1.73^5) chance of failures and for 6 years 26.809%. As the rate is exponential that’s why the HDD have a maximum warranty of 5 years.
    Keep to make great articles

  12. Thanks to the comments, I’ve corrected my math on the probability of drive failures. Hopefully I got it right this time, though I’ll fully admit to being very far removed from the last math class I took!

  13. You mention that vendors may have custom firmware applied to the drives, which indicates familiarity with the (some might say) ‘shenanigans’ that are sometimes pulled… but neglected to note that it’s not uncommon that drives purchased outside of that vendor (and as such, missing their OEM FW) be rejected by the SAN/NAS once inserted as invalid.

    There are multiple levels/lengths that those vendors go to in enforcing the requirement that their drives be purchased from them. In some cases it’s something as simple as having a built in ‘whitelist’ of drive models (their specific revisions, as the same model will be iterated over time), up to as complex as requiring that the serial number match their OEM schema + OEM firmware before the drive will even be detected.

    I’d recommend updating the above article to note that one should confirm whether or not alternatively procured drives will be physically recognized and function in their desired use case prior to making a decision here – whether that’s by research, or by simply purchasing one drive and trying it, however they wish to do so.

LEAVE A REPLY

Please enter your comment!
Please enter your name here