Application Security

RAID Rebuilds Will Kill The Hard Disk

George Crump, President, Storage Switzerland

November 1, 2010

4 Min Read

We've written about it before as have others. RAID rebuild times continue to increase and as they do the very technology that made the hard drive safe for the enterprise thirty years ago may now be its undoing. The time it takes to rebuild a drive, measured in double digit hours if not days, has a critical impact on performance and data reliability. The work arounds may lead you to solid state disk faster than you originally planned.The big risk with long RAID rebuild times is that for the period of time that it takes to rebuild the RAID you are exposed to complete data loss if a second drive fails. To make matters worse the chances of a drive failure are higher during a rebuild and of course the consequences of failure more severe. While RAID 6 does give you an extra safety net, as drive sizes continue to increase, the chance of a second and third failure during the rebuild process will climb exponentially. In addition to the data vulnerability, there is also the performance loss as the storage controllers and the disk drives busily work on re-creating the failed drive.

There are some work arounds even with mechanical drives. First you can overallocate processing power to make sure that there is always spare processing and storage I/O to handle the task. In my opinion this is going to have to be a requirement for storage vendors going forward, to assume a constant state of rebuild due to the size and quantity of drives that we are projecting to have in the typical enterprise array within the next few years. In fact it would make sense and be quite a bold statement for storage suppliers to start providing performance benchmark statistics while a rebuild is in process since that is going to be the reality in the next few years. I bet it won't be long before we begin to see claims of no performance loss during a rebuild.

Another work around is smarter rebuilds. We've seen several approaches for this. First don't kill the whole drive just because a small section of it has gone bad. Mark that section out and keep using the drive. Most drives won't fail completely, they just reach a threshold of unsuitability. Another work around is to understand the blocks of data on the drive and move only those blocks to a new drive. Part of this has been accomplished by smarter file systems commonly seen in NAS systems. Where the data protection is set at a file or folder level and data can be recreated as the rule applies to the file, not the whole disk.

Part of this smarter rebuild could be to schedule the re-build in advance. Using the above concepts, when a drive needs to be retired, meaning it is still working but the bad block count is increasing too quickly, reallocate that data throughout the other drives in the system during idle times. With this approach the storage software monitors the activity of the storage and during lull times it starts repositioning data. When the data has been fully moved to more reliable drives that failing drive can be turned off and flagged for replacement. When it is replaced with a new drive then once again data can be distributed on to it as space allows. Taking this concept a step further the storage system could leverage free space on drives to provide extra data protection when the capacity is available and then back the protection level down as capacity becomes constrained. Considering most systems almost always have tons of excess capacity why not use it that capacity for extra protection and faster rebuilds?

All of these work arounds can make the emerging dedicated flash based appliances even more attractive. These systems are designed from the ground up to leverage the uniqueness of flash which are smaller in capacity on a per module basis but denser in their packaging, less sensitive to tight enclosures and of course significantly faster. As a result rebuild times on these systems are still measured in minutes instead of hours. They accomplish these in most cases with minimal, less than 20%, impact on performance and don't need all the fancy work arounds that mechanical systems do.

However you get there, a topic of conversation with your storage vendors has to be what does the system act like when a RAID is being rebuilt and you have to understand how long you are going to be in an exposed state. Reality is that you will be facing that situation on a near constant basis in the near future.

Track us on Twitter: http://twitter.com/storageswiss

Subscribe to our RSS feed.

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.

About the Author(s)

George Crump

President, Storage Switzerland

George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for datacenters across the US, he has seen the birth of such technologies as RAID, NAS, and SAN. Prior to founding Storage Switzerland, he was CTO at one the nation’s largest storage integrators, where he was in charge of technology testing, integration, and product selection. George is responsible for the storage blog on InformationWeek's website and is a regular contributor to publications such as Byte and Switch, SearchStorage, eWeek, SearchServerVirtualizaiton, and SearchDataBackup.

See more from George Crump

Related Topics

Related Topics

Related Topics

Related Topics

RAID Rebuilds Will Kill The Hard Disk

About the Author(s)

Editor's Choice