Deduplication is the elimination of redundant data typically associated with optimizing storage utilization. I've spent some time lately defending our stance that deduplication in primary storage can be done without a performance penalty. What is not often discussed is that there is also the potential for a performance gain when using deduplication that may outweigh the resources costs associated with the process.

George Crump, President, Storage Switzerland

September 10, 2010

4 Min Read

Deduplication is the elimination of redundant data typically associated with optimizing storage utilization. I've spent some time lately defending our stance that deduplication in primary storage can be done without a performance penalty. What is not often discussed is that there is also the potential for a performance gain when using deduplication that may outweigh the resources costs associated with the process.First from a performance penalty perspective, by no penalty I mean that if configured correctly the right combination of hardware and software should not negatively impact the user or application experience. As I discuss in a recent article this may mean the vendor has to up the ante on the storage processor, but most of the time the storage processor is idle, it definitely means that the software code has to be written very efficiently.

Dedupe can potentially boost both write and read performance in a storage system as well. The read side is somewhat easy to figure out as long as deduplication is out of the read path and data does not need to be re-hydrated on read. This is typically done by leveraging the storage systems' existing extent strategy similar to how snapshots work today. If as a result of deduplication data has been consolidated into fewer bits of information the storage system has less work to do. There is less head movement because there is less data to seek. Deduplication should also increase the likelihood of a cache hit, since from a logical sense more data can fit into the same cache. This can be especially true in a virtualized server environment where much of the server OS images are deduplicated. The active files from the core virtual machine image are leveraged across multiple virtual servers all reading the same or similar information, which now thanks to dedupe is really only one cache friendly instance.

Most deduplication vendors agree that reads are the easy part. Writes are hard. To see a write boost from deduplication to the best of my reasoning is going to require an inline process. Post process or even parallel process deduplication means that the write, redundant or not, always goes to the disk system first and is then later (within seconds with parallel) eliminated if a redundancy is found. While this process can be done with minimal, if any, performance penalty, it would be hard to claim a write performance gain as a result. There is certainly a write that occurred that potentially did not need to and if that write was found to be redundant then some work has to be done to erase or release the redundant information.

With an inline deduplication model a redundant write can be identified and it never has to occur. If that write does not have to occur then neither does the parity calculation nor the subsequent parity writes. In a RAID 6 configuration over the course of not having to save a file or parts of a file that already exists you could be saving hundreds of block writes and parity writes.

As I discussed on a SSD chat on Enterprise Efficiency, where the use of inline deduplication in primary storage could become very interesting is in FLASH based systems. First FLASH memory is more expensive per GB than mechanical hard drives, so every % saved in capacity has 10X more value. Second FLASH's weakness is writes. While FLASH controllers are addressing much of the wear leveling issues that surround FLASH the less writes the longer the life of the solid state system. Finally FLASH storage typically has performance to spare, using the excess to increase its value and reliability makes deduplication a good investment.

FLASH or HDD, using inline deduplication on primary storage has the potential to improve overall write performance in a storage system. The question is will the amount of work that has to occur to determine redundancy negate the gains in write performance? That, as is the case with other forms of deduplication, is largely dependent on how well the deduplication software is written, how efficient its lookups are and how much memory and processing power can affordably be put on the storage system itself.

In either case make no mistake, we are heading into an era where deduplication without a performance penalty is a reality. The software can be made efficient enough and the hardware has enough power to make it a reality.

Track us on Twitter: http://twitter.com/storageswiss

Subscribe to our RSS feed.

George Crump is lead analyst of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. Find Storage Switzerland's disclosure statement here.

About the Author(s)

George Crump

President, Storage Switzerland

George Crump is president and founder of Storage Switzerland, an IT analyst firm focused on the storage and virtualization segments. With 25 years of experience designing storage solutions for datacenters across the US, he has seen the birth of such technologies as RAID, NAS, and SAN. Prior to founding Storage Switzerland, he was CTO at one the nation’s largest storage integrators, where he was in charge of technology testing, integration, and product selection. George is responsible for the storage blog on InformationWeek's website and is a regular contributor to publications such as Byte and Switch, SearchStorage, eWeek, SearchServerVirtualizaiton, and SearchDataBackup.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights