Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

News

3/2/2011
09:30 PM
George Crump
George Crump
Commentary
50%
50%

What We Can Learn From The Gmail Crash

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from happening in your data center.The first lesson to be learned is that storage systems and applications can have bugs introduced into them. Think about what we ask a storage system to do today; snapshots, thin provisioning, replication, automated tiering and deduplication, just to name a few. The reality is today's storage system is far more complicated than it has ever been and will increase in complexity as we move into the future. Complexity is one of the key reasons that things fail. Those failures lead to data corruption or loss.

Your storage system has CPUs and those CPUs run an application that provides the above data services. You could even say that it has what looks like a database. Most storage systems track blocks or inodes to help them manage LUNS, snapshot histories and deduplication meta-data. The cataloging of this information means that the storage controller essentially is creating a database. Databases can corrupt. The chances of having this corruption or a storage controller failure may be .02% but if that happens, you won't care what the odds were. You will be concerned about how to restore the application to the users.

Total protection means a mix of approaches. While instant replication seems to be at the root of the Google problem you don't want to throw this valuable tool out. Realtime replication can provide you with a quick recovery point if an application or physical server fails. Its weakness is that if your are replicating in realtime any data corruption is also replicated in realtime. Along with realtime replication you need something that is "semi-realtime" so that corruption won't be instantly replicated throughout the environment. This means snapshots of primary data and snapshots of secondary replicated data or a software tool that can replicate for you and store data at intervals, not just realtime. Many storage software solutions provide both of these capabilities, you just have to make sure you use it.

Snapshots essentially protect you from data corruption by giving you the ability to turn back time to get to older uncorrupted data. The problem with snapshots though is that in most cases the snapshot is totally dependent on the original volume staying intact. A snapshot is not going to protect you from a physical storage failure. If the physical volume that contains the original data set fails, in most cases, all the snapshot data will fail along with it. This type of failure can occur for a number of reasons, including double (or triple) drive RAID failures and storage controller failures. Snapshots also will not provide you, in most cases, with an immediate recovery point, if you have an application failure as they are usually timed events that occur once an hour or so.

In reality you need your storage controller or third party storage application to provide both interval replication plus a realtime replication capability. Ideally to a second independent storage system. This gives you the ability to recover near instantly in the event of an application failure but also the ability to pull up an older version of the data set in the case of data corruption even if the primary storage system has completely failed. As we discussed in our recent article "Achieving Application Aware Availability" there should also be some level of application awareness in the data recovery plan that will protect against an application freeze or even a performance degradation.

This combination of realtime replication plus near realtime replication gives you complete protection from application failure, data corruption and storage system failure. Assuming you are additionally replicating another copy off-site it also gives you site failure protection. The one thing the application aware and replication combination does not give you is historical recovery. That is something that tape is very well suited for.

It seems like tape backups will eventually save the day for Google. Replication and snapshots add a lot of value. Disk to Disk backup makes sense in today's data centers. At some point though it seems like a good idea to have a copy of data on a totally different form of storage like tape. It also makes sense to have something that can be inexpensively and historically shelved, just in case. Tape, in the Google case, is being criticized for being slow but that may be another operational issue not a technological issue. Thanks to tape the Gmail users will, in theory, eventually get their data back. While not ideal, eventually getting your data back is better than never getting your data back.

Once this variety of data protection techniques are available then the you need to clearly document which data will be sought out first given a failure. We will cover this prioritization of recovery in an upcoming entry.

Track us on Twitter: http://twitter.com/storageswiss

Subscribe to our RSS feed.

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Why Cyber-Risk Is a C-Suite Issue
Marc Wilczek, Digital Strategist & CIO Advisor,  11/12/2019
DevSecOps: The Answer to the Cloud Security Skills Gap
Lamont Orange, Chief Information Security Officer at Netskope,  11/15/2019
Unreasonable Security Best Practices vs. Good Risk Management
Jack Freund, Director, Risk Science at RiskLens,  11/13/2019
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Current Issue
Navigating the Deluge of Security Data
In this Tech Digest, Dark Reading shares the experiences of some top security practitioners as they navigate volumes of security data. We examine some examples of how enterprises can cull this data to find the clues they need.
Flash Poll
Rethinking Enterprise Data Defense
Rethinking Enterprise Data Defense
Frustrated with recurring intrusions and breaches, cybersecurity professionals are questioning some of the industrys conventional wisdom. Heres a look at what theyre thinking about.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2019-19012
PUBLISHED: 2019-11-17
An integer overflow in the search_in_range function in regexec.c in Oniguruma 6.x before 6.9.4_rc2 leads to an out-of-bounds read, in which the offset of this read is under the control of an attacker. (This only affects the 32-bit compiled version). Remote attackers can cause a denial-of-service or ...
CVE-2019-19022
PUBLISHED: 2019-11-17
iTerm2 through 3.3.6 has potentially insufficient documentation about the presence of search history in com.googlecode.iterm2.plist, which might allow remote attackers to obtain sensitive information, as demonstrated by searching for the NoSyncSearchHistory string in .plist files within public Git r...
CVE-2019-19035
PUBLISHED: 2019-11-17
jhead 3.03 is affected by: heap-based buffer over-read. The impact is: Denial of service. The component is: ReadJpegSections and process_SOFn in jpgfile.c. The attack vector is: Open a specially crafted JPEG file.
CVE-2019-19011
PUBLISHED: 2019-11-17
MiniUPnP ngiflib 0.4 has a NULL pointer dereference in GifIndexToTrueColor in ngiflib.c via a file that lacks a palette.
CVE-2019-19010
PUBLISHED: 2019-11-16
Eval injection in the Math plugin of Limnoria (before 2019.11.09) and Supybot (through 2018-05-09) allows remote unprivileged attackers to disclose information or possibly have unspecified other impact via the calc and icalc IRC commands.