This article, in General, is about the NetApp deduplication, relied on the NetApp file system WAFL that keeps track of whether a block was free or in use and now with deduplication, it also keeps track of how many uses there are. In the current implementation, a single WAFL block can be referenced up to 256 times in different files or within the same file. Files don’t “know” that they are sharing their data—bookkeeping within WAFL takes care of the details invisibly.
Two basic components:
The algorithms that are used to identify and eliminate duplication
- Identifying Duplicate Data Most existing deduplication products operate at the block level new blocks are compared against previously stored blocks to determine whether an identical block has been previously stored. If it has been previously stored, the “new” block is discarded in favor of a pointer to the stored block. NetApp etermine whether two blocks are identical? The most common method is that for each block, you compute a “fingerprint,” which is a hash of the data contained in the block. If two blocks have the same fingerprint, they are usually assumed to be identical.
The reliability of the underlying hardware and software
- NetApp protect your backup data, deduplication technology must use appropriate algorithms to avoid discarding unique data blocks and also provide the fundamental hardware and software reliability necessary to safely store deduplicated data for later recovery. Because, NetApp deduplication technology is used for primary data stores as well as for backup data, NetApp takes extra care to protect data reliability. NetApp deduplication uses a combination of fingerprints plus byte-by-byte block comparisons so that unique data blocks are never erroneously deleted due to hash collisions.
- Deduplicated data is stored on NetApp storage systems using hardware and operating software that have been proven reliable and resilient through years of field deployment, so you can be confident that when it comes time to recover data, you’ll get back the data you backed up. However, there is a small but nonzero chance that two nonidentical blocks will yield the same fingerprint or hash value. This is termed a “hash collision” and can result in a unique data block being accidentally deleted.
- As you might expect, reducing the probability of a hash collision requires a more complicated algorithm, which typically consumes more CPU resources to compute the hash and yields a larger output value. So there’s an obvious trade-off between reliability and speed. Also, longer hashes consume more space for fingerprint storage.
- When you are evaluating deduplication technologies, you need to find out how a vendor identifies duplicates and ask about the risk of hash collisions with the chosen algorithm. Many vendors will argue that the chance of a hash collision is lower than the probability of a disk failure or a disk drive error or tape error that corrupts a data block. I don’t know if that’s a truly comforting thought or not, but I believe that most of us would prefer to minimize as many risks as possible.
- Because NetApp supports deduplication for both primary and backup storage, we take a more aggressive approach to preventing hash collisions. We use a fingerprint algorithm like most everyone else, but we use it only to identify potential duplicates. When that happens, we do a byte-by-byte comparison of the two blocks to check that they are identical before discarding any blocks.
- NetApp Deduplication data will be decompress as data gets copied of the Storage to Tape. Veritas Netbackup, for example will compressed/deduplicated using it own algorithm method.
Related Topics: The links are pretty good information, in General, on NetApp RAID-DP
Click Here: http://www.netapp.com/us/communities/tech-ontap/dedupe-0708.html
Click Here: http://blogs.netapp.com/extensible_netapp/wafl/
Click Here: http://blogs.netapp.com/dave/2008/12/is-wafl-a-files.html
Information about VMware ESX NetApp storage deduplication enhancements
Click Here: http://searchvmware.techtarget.com/tip/0,289483,sid179_gci1335560,00.html
Click Here: http://blogs.netapp.com/drdedupe/2009/06/netapp-dedupication-revisited.html
Click Here: http://www.netapp.com/us/products/platform-os/dedupe.html
EMC offers de-duplication, thin provisioning for VMware on Celerra
NetApp Twitter Cloud Computing