10:28 AM
George Crump
George Crump

Active Data Vs. Active Archive

We need better metrics to help us decide what data should be on primary storage and what should be on archive storage.

In my last column I discussed how what we used to consider active data is changing. We now have to look at the potential working set instead of the actual working set. Thanks to initiatives like real-time analytics, some data that we used to classify as archivable now needs to be at the ready. If this is the case, what is the role of archive? How do disk and tape archives participate in an increasingly active world?

The key to a balanced storage strategy, even with all this active data, is to change how we decide to archive a certain set of data. Under the current archive methodology the most common decision point was last modification date. In other words, data that is X days/years old can be archived, everything else has to stay on primary storage. The problem with this methodology is it is not compatible with real-time analytics and not even really compatible with the way users use data.

We need better metrics to help us decide what data should be on primary storage and what should be on archive storage. A key criteria is going to be what data, if it needs to be accessed, will need to be delivered instantly -- in other words, something that may need to be analyzed in the future. This data should probably not go to an archive no matter how old it gets since it could have a statistical probability of value.

[ Learn more about virtual desktop infrastructure. Read VDI Performance And Cost: A Deeper Dive. ]

However, if we know for sure that a certain data set will not be part of a real-time processing application or be needed for analytics then lets archive it as soon as possible and not even wait for it to age. Maybe some of this data could even spend all of its data lifecycle on archive storage because the performance of the archive is "good enough" for the use case.

There is also the need to understand relationships between files. As a simple example, I am writing a couple of books right now. Each of those books have multiple iterations on the file name but large chunks of the content within those files are the same. Each draft gets a different file name. When I get to the end of any of these books, I really don't think I will need all of these drafts but, because all data has become a "you never know" situation, I will want to keep all of them around but I doubt I will ever access them again.

The question is how many of these drafts will I require instant access to and how many could I wait 10 minutes before I view them? For my purposes, all I really will need is the final copy and maybe a couple of the iterations. It would be nice to have software analyze this data and keep versions of the files with the most significant internal changes and then archive the rest.

Interestingly, one of the things we are learning from our primary storage deduplication test is how big of a role this technology can play in these circumstances. Essentially, I can keep all of the files with minimal impact on space utilization. And since they can be disk-based, retrieval time is excellent.

Another classification point is how is that data acted on when recovered? From beginning to end, or at some random point in the file? Basically, can the data be utilized sequentially? If this is the case, then just the front section of that data needs to be stored on primary storage, enough so that it can start being accessed while the back end catches up and the users see no delay in response time. This capability will require a file system intelligent enough to deliver data from two different sources at the same time.

When these attributes of the data are known and understood, then it can be properly placed in the proper types of storage in the data center. Data that whose recovery need is random and unpredictable will need to go on fast storage if analytics are being used. Data that is very similar to other data can be archived or deduplicated.

This archive, depending on what the known recovery need is, can easily be tape based because for a large chunk of the data set how quickly it is recovered is less import than how cost effectively can it be stored.

Comment  | 
Print  | 
More Insights
Threaded  |  Newest First  |  Oldest First
re: Active Data Vs. Active Archive
Hi George - very interesting article (as always). I am a board member of the Active Archive Alliance and SVP Sales for QStar. My main comments are with this paragraph.

"We need better metrics to help us decide what
data should be on primary storage and what should be on archive storage.
A key criteria is going to be what data, if it needs to be accessed,
will need to be delivered instantly -- in other words, something that
may need to be analyzed in the future. This data should probably not go
to an archive no matter how old it gets since it could have a
statistical probability of value".

This assumes that archive storage is slow and primary storage is fast, which is not necessarily correct. Active Archive solutions can use tape but can also use disk or object storage, which is not slow. Accessibility and instant delivery can be provided by Object Storage solutions as an active archive. The key point is getting data away from the primary storage environment, and the constant backup regime that is associated with it, once data is no longer changing. Archives secure data through copy or replication at the time of ingestion removing the on-going need for backup.

Creating hybrid archives is the answer (using disk-based and tape based technology) to your question, and using "Versioning" to store multiple iterations of a file over time is possible and included in many archive solutions. As you point out versions can be stored without significantly consuming capacity. IF you now your data you can move it (perhaps automatically) to the correct archive technology for long-term preservation.

I agree that metrics can always be improved, currently file metadata is about the only way you can decide which files should be moved to fast archive and which to slower, but for many organizations, that is enough.
Register for Dark Reading Newsletters
White Papers
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: This comment is waiting for review by our moderators.
Current Issue
The Changing Face of Identity Management
Mobility and cloud services are altering the concept of user identity. Here are some ways to keep up.
Flash Poll
10 Recommendations for Outsourcing Security
10 Recommendations for Outsourcing Security
Enterprises today have a wide range of third-party options to help improve their defenses, including MSSPs, auditing and penetration testing, and DDoS protection. But are there situations in which a service provider might actually increase risk?
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
Published: 2015-10-15
The Direct Rendering Manager (DRM) subsystem in the Linux kernel through 4.x mishandles requests for Graphics Execution Manager (GEM) objects, which allows context-dependent attackers to cause a denial of service (memory consumption) via an application that processes graphics data, as demonstrated b...

Published: 2015-10-15
netstat in IBM AIX 5.3, 6.1, and 7.1 and VIOS 2.2.x, when a fibre channel adapter is used, allows local users to gain privileges via unspecified vectors.

Published: 2015-10-15
Cross-site request forgery (CSRF) vulnerability in eXtplorer before 2.1.8 allows remote attackers to hijack the authentication of arbitrary users for requests that execute PHP code.

Published: 2015-10-15
Directory traversal vulnerability in QNAP QTS before 4.1.4 build 0910 and 4.2.x before 4.2.0 RC2 build 0910, when AFP is enabled, allows remote attackers to read or write to arbitrary files by leveraging access to an OS X (1) user or (2) guest account.

Published: 2015-10-15
Cisco Application Policy Infrastructure Controller (APIC) 1.1j allows local users to gain privileges via vectors involving addition of an SSH key, aka Bug ID CSCuw46076.

Dark Reading Radio
Archived Dark Reading Radio

The cybersecurity profession struggles to retain women (figures range from 10 to 20 percent). It's particularly worrisome for an industry with a rapidly growing number of vacant positions.

So why does the shortage of women continue to be worse in security than in other IT sectors? How can men in infosec be better allies for women; and how can women be better allies for one another? What is the industry doing to fix the problem -- what's working, and what isn't?

Is this really a problem at all? Are the low numbers simply an indication that women do not want to be in cybersecurity, and is it possible that more women will never want to be in cybersecurity? How many women would we need to see in the industry to declare success?

Join Dark Reading senior editor Sara Peters and guests Angela Knox of Cloudmark, Barrett Sellers of Arbor Networks, Regina Wallace-Jones of Facebook, Steve Christey Coley of MITRE, and Chris Roosenraad of M3AAWG on Wednesday, July 13 at 1 p.m. Eastern Time to discuss all this and more.