Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from happening in your data center.The first lesson to be learned is that storage systems and applications can have bugs introduced into them. Think about what we ask a storage system to do today; snapshots, thin provisioning, replication, automated tiering and deduplication, just to name a few. The reality is today's storage system is far more complicated than it has ever been and will increase in complexity as we move into the future. Complexity is one of the key reasons that things fail. Those failures lead to data corruption or loss.
Your storage system has CPUs and those CPUs run an application that provides the above data services. You could even say that it has what looks like a database. Most storage systems track blocks or inodes to help them manage LUNS, snapshot histories and deduplication meta-data. The cataloging of this information means that the storage controller essentially is creating a database. Databases can corrupt. The chances of having this corruption or a storage controller failure may be .02% but if that happens, you won't care what the odds were. You will be concerned about how to restore the application to the users.
Total protection means a mix of approaches. While instant replication seems to be at the root of the Google problem you don't want to throw this valuable tool out. Realtime replication can provide you with a quick recovery point if an application or physical server fails. Its weakness is that if your are replicating in realtime any data corruption is also replicated in realtime. Along with realtime replication you need something that is "semi-realtime" so that corruption won't be instantly replicated throughout the environment. This means snapshots of primary data and snapshots of secondary replicated data or a software tool that can replicate for you and store data at intervals, not just realtime. Many storage software solutions provide both of these capabilities, you just have to make sure you use it.
Snapshots essentially protect you from data corruption by giving you the ability to turn back time to get to older uncorrupted data. The problem with snapshots though is that in most cases the snapshot is totally dependent on the original volume staying intact. A snapshot is not going to protect you from a physical storage failure. If the physical volume that contains the original data set fails, in most cases, all the snapshot data will fail along with it. This type of failure can occur for a number of reasons, including double (or triple) drive RAID failures and storage controller failures. Snapshots also will not provide you, in most cases, with an immediate recovery point, if you have an application failure as they are usually timed events that occur once an hour or so.
In reality you need your storage controller or third party storage application to provide both interval replication plus a realtime replication capability. Ideally to a second independent storage system. This gives you the ability to recover near instantly in the event of an application failure but also the ability to pull up an older version of the data set in the case of data corruption even if the primary storage system has completely failed. As we discussed in our recent article "Achieving Application Aware Availability" there should also be some level of application awareness in the data recovery plan that will protect against an application freeze or even a performance degradation.
This combination of realtime replication plus near realtime replication gives you complete protection from application failure, data corruption and storage system failure. Assuming you are additionally replicating another copy off-site it also gives you site failure protection. The one thing the application aware and replication combination does not give you is historical recovery. That is something that tape is very well suited for.
It seems like tape backups will eventually save the day for Google. Replication and snapshots add a lot of value. Disk to Disk backup makes sense in today's data centers. At some point though it seems like a good idea to have a copy of data on a totally different form of storage like tape. It also makes sense to have something that can be inexpensively and historically shelved, just in case. Tape, in the Google case, is being criticized for being slow but that may be another operational issue not a technological issue. Thanks to tape the Gmail users will, in theory, eventually get their data back. While not ideal, eventually getting your data back is better than never getting your data back.
Once this variety of data protection techniques are available then the you need to clearly document which data will be sought out first given a failure. We will cover this prioritization of recovery in an upcoming entry.
Track us on Twitter: http://twitter.com/storageswiss
Subscribe to our RSS feed.
George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.