Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Risk

4/20/2012
01:29 PM
George Crump
George Crump
Commentary
50%
50%

How To Protect Big Data Analytics

Big data analytics often means big challenges when it comes to data protection. Here are some things to keep in mind when you're working in these environments.

Big Data Talent War: 10 Analytics Job Trends
Big Data Talent War: 10 Analytics Job Trends
(click image for larger view and for slideshow)
Data protection is often the forgotten part of any trend in the data center, and the launch of big data initiatives is no exception to this trend. Data protection is too often an afterthought. What is particularly challenging with big data, especially big data analytics, as I discussed in a recent column, is that it is the perfect storm for a data protection disaster.

Big data analytics has all the things you don’t want to see when you're trying to protect data. First, it can have a very unique sample set--for example, a device that monitors a soil sample every 30 seconds, a camera that takes thousands of images every minute, or a cell phone call center that logs millions of text messages. All that data is unique to that moment; if it is lost it is impossible to recreate.

That uniqueness also means that the data is probably not deduplicatable. As I discussed in a recent article, you may need to either turn off deduplication, or at least factor in a very low effective rate, in such environments. This means that the capacity of the backup appliance may have to be close to what the real data set is than in other backup situations where you may be counting on a high level of dedupe effectiveness.

[ Bigger data sets mean bigger compliance challenges. Read more at Big Data's Dark Side: Compliance Issues. ]

The large number of files that can be resident in big data analytic environments is also a challenge. In order for the backup application and the appliance to churn through this large number of files, the bandwidth to the backup server and/or the backup appliance needs to be large, and the receiving devices must be able to ingest data at the rate that the data can be delivered. They also need significant CPU processing power to churn through billions of files.

There's also a database component to big data that needs to be considered. Analytic information is often processed into either an Oracle or Hadoop environment of some sort, so live protection of that environment may be required. This means a smaller number of larger files need to be backed up.

This is a worst-case mix workload of high performance: billions of small files with a small number of large files, which may break many backup appliances. Finding one that can ingest this mixed workload of data at full speed, that has a deduplication configuration that won't impact performance, and that can scale to massive capacities may be the biggest challenge in the big data backup market. You may have to consider tape, and if so, the disk backup vendor needs to know how to work with it.

The other form of big data, big data archive, should be less of an issue if it's designed correctly. If the design uses tape as part of the archive, then backup can be built in as part of the workflow. Designing the storage infrastructure for big data archive environments will be the subject of an upcoming column.

Follow Storage Switzerland on Twitter

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Storage Switzerland's disclosure statement.

The Enterprise 2.0 Conference brings together industry thought leaders to explore the latest innovations in enterprise social software, analytics, and big data tools and technologies. Learn how your business can harness these tools to improve internal business processes and create operational efficiencies. It happens in Boston, June 18-21. Register today!

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
MarkNeilson
50%
50%
MarkNeilson,
User Rank: Apprentice
2/23/2014 | 8:38:51 AM
re: How To Protect Big Data Analytics
Analytics is always essential to take care of the de-duplication problem. The analytics clearly explains the overall issues and the increased demand of data management in the upcoming years. If the data is bigger there will be compliance issues. So, it is quite important to take care of such data related issues along with its quality and management. You can also take the help of Data Cleansing software that will solve all your issues.
JTAYLOR9009
50%
50%
JTAYLOR9009,
User Rank: Apprentice
4/20/2012 | 7:36:11 PM
re: How To Protect Big Data Analytics
Interesting article. RDBMS & NoSQL both have their places in the world. I like to think of them like tools; sometime you need a sledgehammer and sometimes you need a chisel. Still equally important.
News
Inside the Ransomware Campaigns Targeting Exchange Servers
Kelly Sheridan, Staff Editor, Dark Reading,  4/2/2021
Commentary
Beyond MITRE ATT&CK: The Case for a New Cyber Kill Chain
Rik Turner, Principal Analyst, Infrastructure Solutions, Omdia,  3/30/2021
Register for Dark Reading Newsletters
White Papers
Video
Cartoon
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you today!
Flash Poll
How Enterprises are Developing Secure Applications
How Enterprises are Developing Secure Applications
Recent breaches of third-party apps are driving many organizations to think harder about the security of their off-the-shelf software as they continue to move left in secure software development practices.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2021-3493
PUBLISHED: 2021-04-17
The overlayfs implementation in the linux kernel did not properly validate with respect to user namespaces the setting of file capabilities on files in an underlying file system. Due to the combination of unprivileged user namespaces along with a patch carried in the Ubuntu kernel to allow unprivile...
CVE-2021-3492
PUBLISHED: 2021-04-17
Shiftfs, an out-of-tree stacking file system included in Ubuntu Linux kernels, did not properly handle faults occurring during copy_from_user() correctly. These could lead to either a double-free situation or memory not being freed at all. An attacker could use this to cause a denial of service (ker...
CVE-2020-2509
PUBLISHED: 2021-04-17
A command injection vulnerability has been reported to affect QTS and QuTS hero. If exploited, this vulnerability allows attackers to execute arbitrary commands in a compromised application. We have already fixed this vulnerability in the following versions: QTS 4.5.2.1566 Build 20210202 and later Q...
CVE-2020-36195
PUBLISHED: 2021-04-17
An SQL injection vulnerability has been reported to affect QNAP NAS running Multimedia Console or the Media Streaming add-on. If exploited, the vulnerability allows remote attackers to obtain application information. QNAP has already fixed this vulnerability in the following versions of Multimedia C...
CVE-2021-29445
PUBLISHED: 2021-04-16
jose-node-esm-runtime is an npm package which provides a number of cryptographic functions. In versions prior to 3.11.4 the AES_CBC_HMAC_SHA2 Algorithm (A128CBC-HS256, A192CBC-HS384, A256CBC-HS512) decryption would always execute both HMAC tag verification and CBC decryption, if either failed `JWEDe...