News
6/19/2012
11:45 AM
George Crump
George Crump
Commentary
50%
50%

Deduplication Performance: More Than Processing Power

Storage performance problems can't be solved by just throwing more processing power at them.

Deduplication, the process of identifying redundant data segments between separate files, is moving its way through the storage infrastructure. The next deduplication targets are primary storage systems and all-flash arrays. As we saw in a recent poll Storage Switzerland conducted, the number one concern is performance impact. You want deduplication, but you don't want it to impact your applications' performance.

Overcoming performance problems is more than just throwing more processing power at the problem. You need intelligent design of the deduplication logic itself. Certainly more processing power helps. As most storage systems upgrade to the latest Intel processors, they will have a new lease on life when it comes to providing all the storage services that you expect from your storage system, like snapshots, replication, and cloning.

Deduplication is a little different though, because there is a database-like lookup that must occur with most deduplication technologies. Anytime there is a lookup, the processor's speed becomes less important because it has to wait on the device that holds the table that performs the lookup. The device's speed becomes more important than processing power.

[ Learn about The Storage Problem Technology Can't Solve. ]

What causes the deduplication lookup? Most deduplication technologies create a table to store information about unique and similar data. As data is sent to a device with deduplication enabled, that data is segmented and then given a hash code. Think of it as a unique serial number. The data is then stored and the serial number or hash code is stored in the table mentioned above. As more data is sent to the device, it is also segmented and given a code. If that code matches a previous code then it is identical data. The table is updated, but the redundant data is not stored. This is where your capacity savings comes from. It is also where your potential performance bottleneck comes from.

As more and more data is stored on the system, that table grows in size. The more unique the data is, the more that table will grow, and, of course, the more total data there is, the more that table will grow. This is important when we talk about deduplication in primary storage. Primary storage does not have near the level of redundancy that backup data does. As a result, the number of entries in the hash table can be dramatically larger than it would be in a backup environment.

Why is this important? It impacts the type of storage device where you can store the lookup table. Most vendors, aware that the speed of the device where the deduplication table is stored is an issue, try to store the entire table in RAM. There is no waiting, but there is a cost and capacity issue

What happens when the number of entries, or the total dataset under management, becomes so large that the deduplication lookup table gets too big to store in RAM? You have to page to a hard disk or SSD, so, essentially, RAM becomes a cache. The problem is that normal cache logic won't work in this situation. This is not a first-in, first-out scenario. You need to verify uniqueness across the entire dataset, not just the most recently stored data.

As we will cover in our upcoming webinar "What is Breaking Deduplication?" if RAM is not used properly, then that means a repeated trip to the hard disk or SSD AND it means that the powerful processor in the storage system has to wait. Its extra processing power--and the money you spent on it--goes to waste.

Follow Storage Switzerland on Twitter

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Storage Switzerland's disclosure statement.

Big data places heavy demands on storage infrastructure. In the new, all-digital Big Storage issue of InformationWeek Government, find out how federal agencies must adapt their architectures and policies to optimize it all. Also, we explain why tape storage continues to survive and thrive.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment: This comment is waiting for review by our moderators.
Current Issue
DNS Threats: What Every Enterprise Should Know
Domain Name System exploits could put your data at risk. Here's some advice on how to avoid them.
Flash Poll
10 Recommendations for Outsourcing Security
10 Recommendations for Outsourcing Security
Enterprises today have a wide range of third-party options to help improve their defenses, including MSSPs, auditing and penetration testing, and DDoS protection. But are there situations in which a service provider might actually increase risk?
Slideshows
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2013-7445
Published: 2015-10-15
The Direct Rendering Manager (DRM) subsystem in the Linux kernel through 4.x mishandles requests for Graphics Execution Manager (GEM) objects, which allows context-dependent attackers to cause a denial of service (memory consumption) via an application that processes graphics data, as demonstrated b...

CVE-2015-4948
Published: 2015-10-15
netstat in IBM AIX 5.3, 6.1, and 7.1 and VIOS 2.2.x, when a fibre channel adapter is used, allows local users to gain privileges via unspecified vectors.

CVE-2015-5660
Published: 2015-10-15
Cross-site request forgery (CSRF) vulnerability in eXtplorer before 2.1.8 allows remote attackers to hijack the authentication of arbitrary users for requests that execute PHP code.

CVE-2015-6003
Published: 2015-10-15
Directory traversal vulnerability in QNAP QTS before 4.1.4 build 0910 and 4.2.x before 4.2.0 RC2 build 0910, when AFP is enabled, allows remote attackers to read or write to arbitrary files by leveraging access to an OS X (1) user or (2) guest account.

CVE-2015-6333
Published: 2015-10-15
Cisco Application Policy Infrastructure Controller (APIC) 1.1j allows local users to gain privileges via vectors involving addition of an SSH key, aka Bug ID CSCuw46076.

Dark Reading Radio
Archived Dark Reading Radio

The cybersecurity profession struggles to retain women (figures range from 10 to 20 percent). It's particularly worrisome for an industry with a rapidly growing number of vacant positions.

So why does the shortage of women continue to be worse in security than in other IT sectors? How can men in infosec be better allies for women; and how can women be better allies for one another? What is the industry doing to fix the problem -- what's working, and what isn't?

Is this really a problem at all? Are the low numbers simply an indication that women do not want to be in cybersecurity, and is it possible that more women will never want to be in cybersecurity? How many women would we need to see in the industry to declare success?

Join Dark Reading senior editor Sara Peters and guests Angela Knox of Cloudmark, Barrett Sellers of Arbor Networks, Regina Wallace-Jones of Facebook, Steve Christey Coley of MITRE, and Chris Roosenraad of M3AAWG on Wednesday, July 13 at 1 p.m. Eastern Time to discuss all this and more.