Why Excess Capacity for WHS/ Vail/ Aurora and Hot Spares for RAID are Necessary – Mean Time to Recover (MTTR) Guide

3
Posted September 1, 2010 by Patrick Kennedy in Storage Reliability

When a hard drive fails in a RAID array or Windows Home Server (WHS)/ Vail/ Aurora system, the data stored on the affected array of disks becomes more vulnerable to data loss. WHS/ Vail/ Aurora and RAID systems, when using array types other than RAID 0, both use redundancy to store data to protect against data loss. In essence a user trades raw capacity of the installed disks for the ability to maintain access to their data after a drive fails. For an explanation of this redundancy and capacity trade-off, see The RAID Reliability Anthology Part 1.

Once a drive fails, MTTR is a major factor in the safety of an array. As MTTR rises, the vulnerability of an array to failure increases. Upon drive failure an administrator has a few options both with WHS/ Vail/ Aurora and RAID implementations. With WHS/ Vail/ Aurora, if the remaining unused capacity in WHS/ Vail is greater than the amount of data lost on the failed drive (assuming the number of disks is three or greater), a user can remove the drive from the WHS/ Vail/ Aurora environment and allow WHS/ Vail/ Aurora to duplicate remaining data on the available space. In a RAID system, a hot spare provides a similar functionality allowing the array to begin the rebuild process immediately after a disk failure. If WHS/ Vail/ Aurora does not have the spare capacity or a RAID system does not have a hot spare, MTTR goes up significantly. This is because MTTR is equal to the amount of time it takes to identify a failed drive, replace the failed drive, and then restore redundancy to the storage system. The equation looks something like the below:

MTTR = Time to Identify Failed Drive + Time to Replace Failed Drive + (Capacity Restored / Rebuild Speed)

Note: The above is calculated in seconds. To use it in many equations, one needs to convert this to hours, days, or years.

With the above equation a few things can be noted. First, in the case of a RAID system with a hot spare Time to Identify Failed Drive + Time to Replace Failed Drive is essentially zero or at most a few seconds.

In a WHS/ Vail/ Aurora situation where there is enough excess capacity, there can be a very small number for Time to Identify Failed Drive + Time to Replace Failed Drive, numbering in a few minutes. This is because a user must log into WHS/ Vail/ Aurora, remove the drive from the storage pool, allow WHS/ Vail/ Aurora to verify the drive can be removed, and then allow the data replication to occur among the remaining disks.

In either scenario, if the system does not have the ability to restore redundancy with already installed resources, this number rises significantly. To show this, we can use four scenarios and plug that into the MTTDL model (this is including UBER calculations not in Part 1 of the RAID Anthology:

Scenario 1: RAID 10 with available Hot Spare

Scenario 2: WHS/ Vail/ Aurora with available excess capacity

Scenario 3: Supported system/ Advanced Replacement with 24 hour response guarantee

Scenario 4: RMA drive to vendor (7 day turn time) and use replacement drive for restoring redundancy

This will not cover the scenario where one goes to a physical store, during business hours, and is able to replace a disk immediately, however, that process will likely take at minimum an hour, and much longer if it occurs in the evening after business hours.

An assumption is that Rebuild Speed is 50MB/s and Capacity Restored is 1840GB * 1024 = 1,884,160MB. That yields a (Capacity Restored / Rebuild Speed) = 37,683s for the entire data set (across the 20 drives), 104.68 hours or 4.36 days. This assumption will be used in the model for every scenario below. Also, I am using 24 hours * 60 minutes * 60 seconds for seconds in a day and 365 days in a year. Those very excited with the exact timing of things and leap year can feel free to adapt this for their own timing needs.

Scenario 1: RAID 10 with available Hot Spare

Here the equation is very simple:

MTTR = Time to Identify Failed Drive + Time to Replace Failed Drive + (Capacity Restored / Rebuild Speed)

MTTR = 0s + 0s + 37,683s = 37,683s, 104.68 hours or 4.36 days. Here is a 20 disk RAID 10 array which one can use as a baseline:

MTTR RAID 10 Hot Spare Scenario 1

MTTR RAID 10 Hot Spare Scenario 1

Scenario 2: WHS/ Vail/ Aurora with available excess capacity

Here we will assume ten minutes to get an automated notification, get to a terminal, and login to the system for the Time to Identify Failed Drive and two minutes between log-in and the rebuild process starting used for Time to Replace Failed Drive. Our equation then becomes

MTTR = Time to Identify Failed Drive + Time to Replace Failed Drive + (Capacity Restored / Rebuild Speed)

MTTR = 600s + 120s + 37,683s = 38,403s

Note, there is not much difference in a 20 drive system’s MTTR under WHS/ Vail/ Aurora and a RAID system with a hot spare:

MTTR RAID 10 Excess Capacity Scenario 2 WHS Vail Aurora

MTTR RAID 10 Excess Capacity Scenario 2

Scenario 3: Supported system/ Advanced Replacement with 24 hour response guarantee

Here we will assume ten minutes to get an automated notification, get to a terminal, and login to the system for the Time to Identify Failed Drive variable. This login will be the time to notify the support vendor that a service call is required. Next let us assume 24 hours for the technician or courier to arrive and five minutes for the installation procedure which yields 8640s + 300s = 8940s for the Time to Replace Failed Drive variable. Our equation then becomes:

MTTR = 600s + 8940s + 37,683s = 47,223s

Here is the impact of that system:

MTTR RAID 10 24 hour drive replacement Scenario 3

MTTR RAID 10 24 hour drive replacement Scenario 3

Scenario 4: RMA drive to vendor (7 day turn time) and use replacement drive for restoring redundancy

Finally, building upon Scenario 4, let us assume all of the same variables except a seven day turnaround time instead of a 24 hour turn around time. The Time to Replace Failed Drive goes from 8940s to 60,780s between Scenario 3 and Scenario 4 making our equation:

MTTR = Time to Identify Failed Drive + Time to Replace Failed Drive + (Capacity Restored / Rebuild Speed)

MTTR = 600s + 60,780s + 37,683s = 99,063s

Here is the failure model given that setup:

MTTR RAID 10 1 Week Replacement Scenario 4

MTTR RAID 10 1 Week Replacement Scenario 4

WHS/ Vail/ Aurora and RAID 10 MTTR with Differing Rebuild Start Time Analysis

As one can see, the differences are not enormous here. Ten years out, the best versus worst MTTDL in the simple model went from 0.951% in Year 10 (Scenario 1) to 1.200% in Year 10 (Scenario 4) when looking at the difference between hot spares immediately available and waiting a week to replace the drives. For some people, that is very significant, but for most users, this may make them think that there is no use for a hot spare or excess capacity. Of course, I had to investigate a bit further.

MTTR RAID 10 All 4 Scenarios

MTTR RAID 10 All 4 Scenarios

RAID 4/ RAID 5 and RAID 6 MTTDL Modelled with Varying MTTR Scenarios

I used the above four models and their varying starting times to see what the effect would be on the MTTDL in both RAID 4/ RAID 5 and RAID 6. Needless to say, I did see larger variances than with the RAID 10 and WHS/ Vail/ Aurora setups used above. I did keep the amount of data stored here at a constant 15TB.

The RAID 4/ RAID 5 modelling looked like the below:

MTTR RAID 4 and RAID 5 (same graph) All 4 Scenarios

MTTR RAID 4 and RAID 5 (same graph) All 4 Scenarios

One can clearly see that there is a 2.5% higher MTTDL in RAID 4/ RAID 5 Year 1 due to hot spare availability (Scenario 1) versus a week long wait for replacement (Scenario 4) which grows to 5.75% differential by year 10. RAID 6 also shows some interesting traits with lower reliability.

MTTR RAID 6 All 4 Scenarios

MTTR RAID 6 All 4 Scenarios

Perhaps more dramatic RAID 6 shows a 0.518% Year 1 MTTDL in Scenario 1 versus 0.745% in Scenario 4. By Year 10, the Scenario 1 versus Scenario 4 differential was between 5.060% and 7.202%. That is a very large difference as far as these things go.

Conclusion

Overall, one can clearly see that RAID 10 and a WHS/ Vail/ Aurora style system is more tolerant of longer replacement times. Conversely, RAID 4/ RAID 5 and RAID 6 all show some fairly significant changes between having an immediate hot spare available and having to wait a week for a RMA to be processed. I should note that I have seen hard drive RMA’s take upwards of two and three weeks this year, so it is very possible that the MTTR numbers could have a delay significantly longer than one week.


About the Author

Patrick Kennedy

Patrick has been running ServeTheHome since 2009 and covers a wide variety of home and small business IT topics. For his day job, Patrick is a management consultant focused in the technology industry and has worked with numerous large hardware and storage vendors in the Silicon Valley. The goal of STH is simply to help users find some information about basic server building blocks. If you have any helpful information please feel free to post on the forums.

3 Comments


  1.  
    Evan Tirus

    Cool article. Keep up the graphs and model.




  2.  
    Chimel

    “1840GB * 1024 = 1,884,160MB. That yields a (Capacity Restored / Rebuild Speed) = 37,683s, 104.68 hours or 4.36 days.”

    Wow there, it’s 10 hours, not 104.




    •  

      Very true. The 37,683s was for the entire dataset across all drives not one drive. Right number, poor description on my part. I edited that line to hopefully make it a bit more clear. Thanks for the catch.





Leave a Response

(required)


Newly Reviewed
 
  • Pliant SanDisk Lightning SLC SAS Drives
  • 9.3
    Supermicro X10SRi-F Cover
  • ASUS Z9NA-D6 USB PCIe Spacing
  • Intel Xeon E5-2620 V2 v V3 Percentage
  • Intel RMM4_LITE
  • SanDisk Optimus SAS SSD