Archives of the TeradataForum
Message Posted: Tue, 07 Jan 2003 @ 21:45:43 GMT
First of all my apologies to any one who is an expert of failure modes and probability. I like to dabble but I'm an amateur.
RAID offers protection against a single disk failure. For RAID 1 to fail both disks in a pair must fail. For this to happen the second disk must fail before the first is replaced. If you know the mean time between failure (MTBF) and the mean time to repair (MTR) you can calculate the probability of a failure in any given period. This is small for a single pair of disks, for a system with a large number of disks it can grow quite to be unacceptable.
MTBF (pair) = MTBF ^ 2 / MTR / 2
Lets assume the MTBF of a disk is 4 years (this is a complete guess my apologies to Seagate is it's to low) and the mean time to repair is 4 hours (this has to include the time the system takes to recreate the data on the new disk). We get a disk pair failure every 17,532 years or to put it another way a 0.0057% chance of a failure in a given year. For a hypothetical eight node system with 20 disk pairs per node you get an MTBF of 110 years or a 0.91% chance of a failure in a given year.
For RAID 5 with 4 disk ranks if any of the other three disks in the rank fail the rank will fail.
MTBF (rank) = 1 / ( MTR / MTBF x ( 1 - ( 1 - 1 / MTBF ) ^3 ) ) / 4
We get a disk rank failure every 2,922 years or to put it another way a 0.034% chance of a rank failing in a given year. For a hypothetical eight node system with 10 disk ranks per node you get an MTBF of 37 years or a 2.7% chance of a failure in a given year. It should also be noted that there is a performance hit during a single disk failure as parity bits must be read from the existing disks. Reconstruction times are also significantly greater. Also per disk you get less performance with RAID 5 so you end up needing more per node. You may have noticed I don't like RAID 5. With disks getting larger it is hard to justify degrading performance for more storage.
These calculations do not take into account other failures (such as the controllers and power supplies) or human error (such as some one turning off a disk cabinet, cleaner have to plug in vacuum cleaners some where).
The bottom line is that as systems become larger the number of failures will increase to get to 99.8% availability fall back is required. This may not be required for all applications but can be used selectively for the tables that are used by applications requiring this level of availability.
|Copyright 2016 - All Rights Reserved|
|Last Modified: 28 Jun 2020|