Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


09:30 PM
George Crump
George Crump

What We Can Learn From The Gmail Crash

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from happening in your data center.The first lesson to be learned is that storage systems and applications can have bugs introduced into them. Think about what we ask a storage system to do today; snapshots, thin provisioning, replication, automated tiering and deduplication, just to name a few. The reality is today's storage system is far more complicated than it has ever been and will increase in complexity as we move into the future. Complexity is one of the key reasons that things fail. Those failures lead to data corruption or loss.

Your storage system has CPUs and those CPUs run an application that provides the above data services. You could even say that it has what looks like a database. Most storage systems track blocks or inodes to help them manage LUNS, snapshot histories and deduplication meta-data. The cataloging of this information means that the storage controller essentially is creating a database. Databases can corrupt. The chances of having this corruption or a storage controller failure may be .02% but if that happens, you won't care what the odds were. You will be concerned about how to restore the application to the users.

Total protection means a mix of approaches. While instant replication seems to be at the root of the Google problem you don't want to throw this valuable tool out. Realtime replication can provide you with a quick recovery point if an application or physical server fails. Its weakness is that if your are replicating in realtime any data corruption is also replicated in realtime. Along with realtime replication you need something that is "semi-realtime" so that corruption won't be instantly replicated throughout the environment. This means snapshots of primary data and snapshots of secondary replicated data or a software tool that can replicate for you and store data at intervals, not just realtime. Many storage software solutions provide both of these capabilities, you just have to make sure you use it.

Snapshots essentially protect you from data corruption by giving you the ability to turn back time to get to older uncorrupted data. The problem with snapshots though is that in most cases the snapshot is totally dependent on the original volume staying intact. A snapshot is not going to protect you from a physical storage failure. If the physical volume that contains the original data set fails, in most cases, all the snapshot data will fail along with it. This type of failure can occur for a number of reasons, including double (or triple) drive RAID failures and storage controller failures. Snapshots also will not provide you, in most cases, with an immediate recovery point, if you have an application failure as they are usually timed events that occur once an hour or so.

In reality you need your storage controller or third party storage application to provide both interval replication plus a realtime replication capability. Ideally to a second independent storage system. This gives you the ability to recover near instantly in the event of an application failure but also the ability to pull up an older version of the data set in the case of data corruption even if the primary storage system has completely failed. As we discussed in our recent article "Achieving Application Aware Availability" there should also be some level of application awareness in the data recovery plan that will protect against an application freeze or even a performance degradation.

This combination of realtime replication plus near realtime replication gives you complete protection from application failure, data corruption and storage system failure. Assuming you are additionally replicating another copy off-site it also gives you site failure protection. The one thing the application aware and replication combination does not give you is historical recovery. That is something that tape is very well suited for.

It seems like tape backups will eventually save the day for Google. Replication and snapshots add a lot of value. Disk to Disk backup makes sense in today's data centers. At some point though it seems like a good idea to have a copy of data on a totally different form of storage like tape. It also makes sense to have something that can be inexpensively and historically shelved, just in case. Tape, in the Google case, is being criticized for being slow but that may be another operational issue not a technological issue. Thanks to tape the Gmail users will, in theory, eventually get their data back. While not ideal, eventually getting your data back is better than never getting your data back.

Once this variety of data protection techniques are available then the you need to clearly document which data will be sought out first given a failure. We will cover this prioritization of recovery in an upcoming entry.

Track us on Twitter: http://twitter.com/storageswiss

Subscribe to our RSS feed.

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.

Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Cyberattacks Are Tailored to Employees ... Why Isn't Security Training?
Tim Sadler, CEO and co-founder of Tessian,  6/17/2021
7 Powerful Cybersecurity Skills the Energy Sector Needs Most
Pam Baker, Contributing Writer,  6/22/2021
Microsoft Disrupts Large-Scale BEC Campaign Across Web Services
Kelly Sheridan, Staff Editor, Dark Reading,  6/15/2021
Register for Dark Reading Newsletters
White Papers
Current Issue
The State of Cybersecurity Incident Response
In this report learn how enterprises are building their incident response teams and processes, how they research potential compromises, how they respond to new breaches, and what tools and processes they use to remediate problems and improve their cyber defenses for the future.
Flash Poll
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2021-06-24
In the bindata RubyGem before version 2.4.10 there is a potential denial-of-service vulnerability. In affected versions it is very slow for certain classes in BinData to be created. For example BinData::Bit100000, BinData::Bit100001, BinData::Bit100002, BinData::Bit<N>. In combination with &lt...
PUBLISHED: 2021-06-24
The blockchain node in FISCO-BCOS V2.7.2 may have a bug when dealing with unformatted packet and lead to a crash. A malicious node can send a packet continuously. The packet is in an incorrect format and cannot be decoded by the node correctly. As a result, the node may consume the memory sustainabl...
PUBLISHED: 2021-06-23
Vulnerability in OpenGrok (component: Web App). Versions that are affected are 1.6.7 and prior. Easily exploitable vulnerability allows low privileged attacker with network access via HTTPS to compromise OpenGrok. Successful attacks of this vulnerability can result in takeover of OpenGrok. CVSS 3.1 ...
PUBLISHED: 2021-06-23
A vulnerability in SonicOS where the HTTP server response leaks partial memory by sending a crafted HTTP request, this can potentially lead to an internal sensitive data disclosure vulnerability.
PUBLISHED: 2021-06-23
A command execution vulnerability exists in the default legacy spellchecker plugin in Moodle 3.10. A specially crafted series of HTTP requests can lead to command execution. An attacker must have administrator privileges to exploit this vulnerabilities.