Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


09:30 PM
George Crump
George Crump

What We Can Learn From The Gmail Crash

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from

Google's Gmail had a glitch introduced that caused 30,000 users or so to loose email, chat and contacts from their Gmail accounts. The cause appears to be a bug in a software update. The current piling on by some storage vendors is humorous. As my mother used to say "people in glass houses shouldn't throw stones". Instead of doing that, lets learn from this experience so we can keep this from happening in your data center.The first lesson to be learned is that storage systems and applications can have bugs introduced into them. Think about what we ask a storage system to do today; snapshots, thin provisioning, replication, automated tiering and deduplication, just to name a few. The reality is today's storage system is far more complicated than it has ever been and will increase in complexity as we move into the future. Complexity is one of the key reasons that things fail. Those failures lead to data corruption or loss.

Your storage system has CPUs and those CPUs run an application that provides the above data services. You could even say that it has what looks like a database. Most storage systems track blocks or inodes to help them manage LUNS, snapshot histories and deduplication meta-data. The cataloging of this information means that the storage controller essentially is creating a database. Databases can corrupt. The chances of having this corruption or a storage controller failure may be .02% but if that happens, you won't care what the odds were. You will be concerned about how to restore the application to the users.

Total protection means a mix of approaches. While instant replication seems to be at the root of the Google problem you don't want to throw this valuable tool out. Realtime replication can provide you with a quick recovery point if an application or physical server fails. Its weakness is that if your are replicating in realtime any data corruption is also replicated in realtime. Along with realtime replication you need something that is "semi-realtime" so that corruption won't be instantly replicated throughout the environment. This means snapshots of primary data and snapshots of secondary replicated data or a software tool that can replicate for you and store data at intervals, not just realtime. Many storage software solutions provide both of these capabilities, you just have to make sure you use it.

Snapshots essentially protect you from data corruption by giving you the ability to turn back time to get to older uncorrupted data. The problem with snapshots though is that in most cases the snapshot is totally dependent on the original volume staying intact. A snapshot is not going to protect you from a physical storage failure. If the physical volume that contains the original data set fails, in most cases, all the snapshot data will fail along with it. This type of failure can occur for a number of reasons, including double (or triple) drive RAID failures and storage controller failures. Snapshots also will not provide you, in most cases, with an immediate recovery point, if you have an application failure as they are usually timed events that occur once an hour or so.

In reality you need your storage controller or third party storage application to provide both interval replication plus a realtime replication capability. Ideally to a second independent storage system. This gives you the ability to recover near instantly in the event of an application failure but also the ability to pull up an older version of the data set in the case of data corruption even if the primary storage system has completely failed. As we discussed in our recent article "Achieving Application Aware Availability" there should also be some level of application awareness in the data recovery plan that will protect against an application freeze or even a performance degradation.

This combination of realtime replication plus near realtime replication gives you complete protection from application failure, data corruption and storage system failure. Assuming you are additionally replicating another copy off-site it also gives you site failure protection. The one thing the application aware and replication combination does not give you is historical recovery. That is something that tape is very well suited for.

It seems like tape backups will eventually save the day for Google. Replication and snapshots add a lot of value. Disk to Disk backup makes sense in today's data centers. At some point though it seems like a good idea to have a copy of data on a totally different form of storage like tape. It also makes sense to have something that can be inexpensively and historically shelved, just in case. Tape, in the Google case, is being criticized for being slow but that may be another operational issue not a technological issue. Thanks to tape the Gmail users will, in theory, eventually get their data back. While not ideal, eventually getting your data back is better than never getting your data back.

Once this variety of data protection techniques are available then the you need to clearly document which data will be sought out first given a failure. We will cover this prioritization of recovery in an upcoming entry.

Track us on Twitter: http://twitter.com/storageswiss

Subscribe to our RSS feed.

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.

Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
COVID-19: Latest Security News & Commentary
Dark Reading Staff 10/27/2020
Modern Day Insider Threat: Network Bugs That Are Stealing Your Data
David Pearson, Principal Threat Researcher,  10/21/2020
Are You One COVID-19 Test Away From a Cybersecurity Disaster?
Alan Brill, Senior Managing Director, Cyber Risk Practice, Kroll,  10/21/2020
Register for Dark Reading Newsletters
White Papers
Current Issue
Special Report: Computing's New Normal
This special report examines how IT security organizations have adapted to the "new normal" of computing and what the long-term effects will be. Read it and get a unique set of perspectives on issues ranging from new threats & vulnerabilities as a result of remote working to how enterprise security strategy will be affected long term.
Flash Poll
How IT Security Organizations are Attacking the Cybersecurity Problem
How IT Security Organizations are Attacking the Cybersecurity Problem
The COVID-19 pandemic turned the world -- and enterprise computing -- on end. Here's a look at how cybersecurity teams are retrenching their defense strategies, rebuilding their teams, and selecting new technologies to stop the oncoming rise of online attacks.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
PUBLISHED: 2020-10-28
An Arbitrary File Upload in the Upload Image component in SourceCodester Car Rental Management System 1.0 allows the user to conduct remote code execution via admin/index.php?page=manage_car because .php files can be uploaded to admin/assets/uploads/ (under the web root).
PUBLISHED: 2020-10-28
The RandomGameUnit extension for MediaWiki through 1.35 was not properly escaping various title-related data. When certain varieties of games were created within MediaWiki, their names or titles could be manipulated to generate stored XSS within the RandomGameUnit extension.
PUBLISHED: 2020-10-27
The search functionality of the Greenmart theme 2.4.2 for WordPress is vulnerable to XSS.
PUBLISHED: 2020-10-27
This issue was addressed with improved checks to prevent unauthorized actions. This issue is fixed in Apple Music 3.4.0 for Android. A malicious application may be able to leak a user's credentials.
PUBLISHED: 2020-10-27
An access issue was addressed with improved access restrictions. This issue is fixed in macOS Catalina 10.15.3, Security Update 2020-001 Mojave, Security Update 2020-001 High Sierra. A malicious application may be able to overwrite arbitrary files.