Big data analytics has all the things you don’t want to see when you're trying to protect data. First, it can have a very unique sample set--for example, a device that monitors a soil sample every 30 seconds, a camera that takes thousands of images every minute, or a cell phone call center that logs millions of text messages. All that data is unique to that moment; if it is lost it is impossible to recreate.
That uniqueness also means that the data is probably not deduplicatable. As I discussed in a recent article, you may need to either turn off deduplication, or at least factor in a very low effective rate, in such environments. This means that the capacity of the backup appliance may have to be close to what the real data set is than in other backup situations where you may be counting on a high level of dedupe effectiveness.
[ Bigger data sets mean bigger compliance challenges. Read more at Big Data's Dark Side: Compliance Issues. ]
The large number of files that can be resident in big data analytic environments is also a challenge. In order for the backup application and the appliance to churn through this large number of files, the bandwidth to the backup server and/or the backup appliance needs to be large, and the receiving devices must be able to ingest data at the rate that the data can be delivered. They also need significant CPU processing power to churn through billions of files.
There's also a database component to big data that needs to be considered. Analytic information is often processed into either an Oracle or Hadoop environment of some sort, so live protection of that environment may be required. This means a smaller number of larger files need to be backed up.
This is a worst-case mix workload of high performance: billions of small files with a small number of large files, which may break many backup appliances. Finding one that can ingest this mixed workload of data at full speed, that has a deduplication configuration that won't impact performance, and that can scale to massive capacities may be the biggest challenge in the big data backup market. You may have to consider tape, and if so, the disk backup vendor needs to know how to work with it.
The other form of big data, big data archive, should be less of an issue if it's designed correctly. If the design uses tape as part of the archive, then backup can be built in as part of the workflow. Designing the storage infrastructure for big data archive environments will be the subject of an upcoming column.
Follow Storage Switzerland on Twitter
The Enterprise 2.0 Conference brings together industry thought leaders to explore the latest innovations in enterprise social software, analytics, and big data tools and technologies. Learn how your business can harness these tools to improve internal business processes and create operational efficiencies. It happens in Boston, June 18-21. Register today!