In our last entry we discussed lessons to be learned from the Gmail crash. In an upcoming entry we'll cover establishing the tiered recovery points. These three tiers of recovery; high availability (HA), backup and archive provide a similar goal; application availability. What separates them is the time it takes to put the data back in place so the application can return to service. Dealing with recovery transfer time is an important factor in the overall recovery process.Recovery transfer time is the total time it takes to move data from the recovery device back to the production device. This is a factor that many IT administrators miss or only calculate when an application initially goes into production. The problem is that typically the longer an application is in production the larger the data set grows. As the data set continues to increase in size so does the time it takes to reposition that data, in case of a failure, on the primary storage device. Adjustments in recovery expectations or further investment in recovery performance need to be made on an ongoing basis. I would love to see the backup reporting products develop an estimated recovery time report that does this for you automatically. Without that capability the recovery time has to be calculated manually, typically through a real test recovery.
The best way to handle the transfer time may be to eliminate it but much of the advancements in data protection have been around reducing the bandwidth requirements that a backup needs. Capabilities like synthetic full, block level incremental and deduplicated backup all eliminate much of the data that needs to be transferred as part of the backup process. While you may use all of these techniques to shrink the backup window, you cannot typically use it to help with the recovery effort. All has to be moved back to the production device and at full size. Even if the network connection were not the issue there is also the concern about write speeds. Recovery is a write operation, backup is a read operation, writes are slower than reads. To make matters worse that primary storage device is likely using RAID 5 or RAID 6 for protection. This means additional time is spent calculating the parity information and again writing that information. In short, recoveries are almost always slower than backups.
There are vendors that are trying to address the transfer problem though. For example HA solutions, as we discuss in "Achieving Application Aware Availability", that write data to a second independent disk target have the ability to perform a transfer-less recovery. This means the application can either switch to the second data set or the application can be started on another host that has the data. In both cases recovery time is measured in the time it takes to boot the server. Alternatively as we discuss in "Virtualization Powered Recovery" there are solutions emerging that leverage the virtual environment to be able to boot a virtual machine directly from the backup itself. This brings the application back into production while the data can be migrated back to the primary storage device.
Reducing the time it takes to reposition data is critical to returning applications and services to users. Pay close attention to the transfer time not only as the application is initially deployed but also as it matures. For applications that, either because of criticality or size of the data set can't be repositioned into place in an acceptable timeframe, look for solutions that can eliminate the transfer process all together.
Track us on Twitter: http://twitter.com/storageswiss
Subscribe to our RSS feed.
George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.