News
6/10/2011
03:46 PM
George Crump
George Crump
Commentary
Connect Directly
RSS
E-Mail
50%
50%

Big Data A Big Backup Challenge

Backing up Big Data requires a system that is fast, cost effective, and reliable. These are conflicting terms in the world of storage.

Big Data is, well, big, and size is not the only challenge it places on backup. It also is a backup application's worst nightmare because many Big Data environments consist of millions or even billions of small files. How do you design a backup infrastructure that will support the Big Data realities?

First, examine what data does not have to be backed up at all because it can be easily regenerated from another system that is already being backed up. A good example is report data generated from a database.

Once this data is identified, exclude it. Next, move on to the real problem at hand--unique data that can't be re-created. This is often discrete file data that is feed into the environment via devices or sensors. It is essentially point-in-time data that can't be regenerated. This data is often copied within the Big Data environment so that it can be safely analyzed. As a result, there can be a fair amount of redundancy in the Big Data environment. This is an ideal role for disk backup devices. They are better suited for the small file transfers and, with deduplication, can eliminate redundancy and compress much of the data to optimize backup capacity.

Effective optimization is critical since Big Data environments are measured in the 100's of terabytes and will soon be measured in the dozens of petabytes. It is also important to consider just how far you want to extend disk backup's role in this environment.

Clearly deduplicated disk is needed, but it probably should be used in conjunction with tape--not in replacement of it. Again, often a large section of this data can't be regenerated. Loss of this data is permanent and potentially ruins the Big Data sample. You can't be too careful and, at the same time, you have to control capacity costs so that the value of the decisions that Big Data allows are not overshadowed by the expense of keeping the data that supports them. We suggest a Big Data backup strategy that includes a large tier of optimized backup disk to store the near-term data set for as long as possible, seven to 10 years worth of data being ideal, then using tape for the decades worth of less frequently accessed data.

Alternatively you could go with the suggestion we made in a recent article "Tape's Role in Big Data" and combine the two into a single active archive--essentially a single file system that seamlessly marries all of these media types. This would consist of fast but low capacity (by Big Data standards) primary disk for data ingestion and active analytical processing, optimized disk for more near term data that is not being analyzed at that moment, and tape for long-term storage. In this environment data can be sent to all tiers of storage as it is created or modified so that less or even no backups need to be done.

Big Data is a big storage challenge, not only to store the data but to put it on a fast enough platform that meaningful analytics can be run while at the same time, being cost effective and reliable. These are conflicting terms in the world of storage. Resolving that conflict is going to require a new way of doing things.

Follow Storage Switzerland on Twitter

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Storage Switzerland's disclosure statement.

Comment  | 
Print  | 
More Insights
Register for Dark Reading Newsletters
White Papers
Flash Poll
Current Issue
Cartoon
DevOps’ Impact on Application Security
DevOps’ Impact on Application Security
Managing the interdependency between software and infrastructure is a thorny challenge. Often, it’s a “developers are from Mars, systems engineers are from Venus” situation.
Video
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2014-3541
Published: 2014-07-29
The Repositories component in Moodle through 2.3.11, 2.4.x before 2.4.11, 2.5.x before 2.5.7, 2.6.x before 2.6.4, and 2.7.x before 2.7.1 allows remote attackers to conduct PHP object injection attacks and execute arbitrary code via serialized data associated with an add-on.

CVE-2014-3542
Published: 2014-07-29
mod/lti/service.php in Moodle through 2.3.11, 2.4.x before 2.4.11, 2.5.x before 2.5.7, 2.6.x before 2.6.4, and 2.7.x before 2.7.1 allows remote attackers to read arbitrary files via an XML external entity declaration in conjunction with an entity reference, related to an XML External Entity (XXE) is...

CVE-2014-3543
Published: 2014-07-29
mod/imscp/locallib.php in Moodle through 2.3.11, 2.4.x before 2.4.11, 2.5.x before 2.5.7, 2.6.x before 2.6.4, and 2.7.x before 2.7.1 allows remote attackers to read arbitrary files via a package with a manifest file containing an XML external entity declaration in conjunction with an entity referenc...

CVE-2014-3544
Published: 2014-07-29
Cross-site scripting (XSS) vulnerability in user/profile.php in Moodle through 2.3.11, 2.4.x before 2.4.11, 2.5.x before 2.5.7, 2.6.x before 2.6.4, and 2.7.x before 2.7.1 allows remote authenticated users to inject arbitrary web script or HTML via the Skype ID profile field.

CVE-2014-3545
Published: 2014-07-29
Moodle through 2.3.11, 2.4.x before 2.4.11, 2.5.x before 2.5.7, 2.6.x before 2.6.4, and 2.7.x before 2.7.1 allows remote authenticated users to execute arbitrary code via a calculated question in a quiz.

Best of the Web
Dark Reading Radio
Archived Dark Reading Radio
Sara Peters hosts a conversation on Botnets and those who fight them.