Storage market analysts believe data de-duplication to drive interest in and adoption of disk-based backup solutions, including VTLs, for the simple reason that it increases the value proposition of these technologies. This article delves deeper into this technology to understand its approach and future take.
Storage research firms believe data de-duplication will be one of this decade’s most important new data protection technologies. Why? Because data de-duplication has the ability to revolutionize data protection by making disk-based backup and remote backup and replication much more efficient than it is today. Research firms consistently find cost to be the number-one obstacle to disk-based backup adoption, and data de-duplication lowers associated disk costs by reducing back-end disk capacity requirements
As with many new technologies, there is a lot of confusion in the market about data de-duplication. In fact, recently ESG Research revealed strong interest in and awareness of data de-duplication among organizations of varying sizes and industries. ESG believes such strong interest in data de-duplication this early in the adoption curve could be indicative of either underlying confusion in the market about what constitutes data de-duplication and what doesn’t or the compelling nature of data de-duplication, which sets it apart from other emerging technologies and enables it to break the rules of typical technology adoption curves. Data de-duplication is transparent. It doesn’t rely on a number of dependent variables for it to be widely adopted.
Understanding Data De-duplication
Research firm ESG defines data de-duplication as the process of eliminating or removing redundant files, bytes or blocks of data to ensure that only ‘unique’ data is stored on disk. Data de-duplication is also an example of what ESG refers to as a capacity optimized protection, or COP technology. COP technologies are designed to reduce data protection-related capacity requirements. The potential benefits of data de-duplication are many, but the most notable advantage is that data deduplication addresses the “capacity bloat” problem head-on by significantly reducing the amount of capacity required on the back-end. This
The more granular the de-duplication process, the greater the capacity reduction. In general, data de-duplication that is done at the file level, while still effective, detects less duplicate data than de-duplication that is done at the byte or block level simply, and likewise, de-duplication that is done at the block level is generally more efficient at detecting duplicate data than data that is de-duplicated at the byte level.
This difference in granularity is illustrated in the following example: An end-user creates a 1MB PowerPoint presentation and then sends it out as an e-mail attachment to 20 internal people for review. In a traditional backup environment -- that is, one without data de-duplication -- each attachment would be backed up at the end of the day during the nightly full backup even though no changes were made to the files, consuming unnecessary disk capacity (20 X 1 MB). Even in a small organization, the cost of this type of redundancy can be significant in terms of physical disk capacity, power and cooling for that disk, etc.
With file-level data de-duplication, however, only one copy of the PowerPoint file is saved. All other attachments (i.e., the duplicate or repeated copies) are replaced with “pointers.” This frees up disk capacity for other applications and allows users to extend retention periods, should they desire to.
More granular de-duplication approaches, block- and byte-level, take the process one step further. They look at the pieces that make up those new 1MB files and compare them to elements that the de-duplication system has seen before, replacing repeated elements in the newer files with pointers rather than storing them again. (It should be noted that there are differences among vendors in how these processes are handled. Some performance differences between products may, in some cases, be related to the way that elements are compared and the way they are written to and managed on disk.)
Of course, there are other considerations besides the granularity of the de-duplication process, which can affect de-duplication ratios. For example, the type of data they generate (some data is inherently a lot more prone to duplicates than others), the frequency of the change rate, etc, all affect de-duplication ratios. One other important note about data de-duplication before we move on: It is a feature or a technology; it is not a standalone product. Its first application is in the data protection and retention market.
Applying Data De-Duplication to Traditional Backup
Data de-duplication, when added to traditional backup approaches (e.g., full, incremental, and differential backups), can have significant positive benefits, significantly reducing the amount of data that has to be backed up.
Let’s look at the following backup approaches more closely: full backups, incremental backups, differential backups, and what storage experts refer to as data de-duplicated backups.
• Full backups: Full backups are generally performed on some type of regular basis (e.g., nightly, weekly, etc.) and involve taking a complete copy, or image, of an organization’s data. Full backups do not distinguish between “changed” data or “unique” data. They make a copy of all data with every backup. However, restoring from a full backup is generally more streamlined and less-time consuming than some other backup approaches.
• Incremental backups: Unlike full backups, incremental backups copy only files that have changed since the last full or incremental backup. The main advantage of incremental backups is that they reduce the amount of files that you are backing up daily (versus a full backup), which allows for shorter backup windows. However, the restore process can be significantly longer since the last full and all subsequent incremental images, or copies, must be restored.
• Differential backups: Differential backups back up “all” data modified since the last “full” backup. This differs from incremental backups which include only data modified since the previous full or incremental backup. Once a file changes, it is backed up daily until the next scheduled full backup.
So, clearly, the disadvantage with differential backups is that the size of the backup increases throughout the week as files are changed, becoming progressively larger until the next weekly full backup. However, on the recovery side, only the full backup image and most recent differential image need to be restored, potentially providing quicker restore than an incremental backup, depending on when the restore occurs.
• De-duplicated backups: By applying de-duplication to these three traditional backup approaches, users can significantly reduce the amount of non-unique data they backup. Full backups, incremental backups and differential backups do not scan for “uniqueness.” Again, the actual de-duplication rate depends on a number of variables but 10x to 20x-plus is typical.
Benefits You Achieve
Data de-duplication has several significant and immediate benefits for users. First and foremost, it can significantly reduce backup capacity requirements, which, among other things, can translate into cost-savings in a number of ways. It frees up capacity for backup data and enables longer retention periods, improves RTOs and reliability and can make WAN-based remote backup and replication more efficient. Let’s take a look at each of these more closely:
• Reduced backup capacity requirements translate into cost-savings. While the actual amount of capacity reduction varies from organization to organization depending on a number of variables, including the type of data that is being backed up, the change rate of the data and the frequency of the backup. The ability to reduce disk capacity requirements by this ratio has some powerful cost-savings benefits for users, including lower disk and, perhaps equally important, lower power and cooling costs. In the case of disk costs, just consider the ability to store 20TB of backup data on 1TB of disk. The cost-savings are significant. In the case of power and cooling, which is becoming an increasingly important consideration in today’s data protection environments, the ability to store more backup data on less disk (e.g., 20TB of backup data on 1TB of disk capacity) can reduce power and cooling requirements considerably.
• “Freed up” capacity means more room for other backup data and longer retention periods with less media management. Data de-duplication can reduce the amount of physical disk needed for backup. Users can use this “reclaimed” space for several purposes: 1) to bring other backup data onto disk and 2) to lengthen the retention periods of data that is backed up to disk. Bottom-line: De-duplication allows users to leverage disk as a backup target for more data and, importantly, allows data to be kept on disk for longer periods of time. Doing so has potentially huge benefits for users. Think about it. What if you could recover data that is three to six months old or even older – without ever having to go to tape? Without data de-duplication, doing so would not be economically prudent, but with data de-duplication, it is not only possible, it is cost-effective. Tape is reserved for long-term archival of data that is infrequently, if at all, accessed and for “doomsday” data recovery.
• Data de-duplication enables better RTOs and improves reliability. The more data users back up to disk, the better able they are to meet RTOs. Hence, data protection SLAs. Data de-duplication allows users to keep more data on disk and for longer periods, which enhances RTOs. The fact is recovery from disk is a lot fast than recovery from tape. As for reliability, again, data is retained longer on disk, which means users rely less on tape for recoveries.
• Enables and expands WAN-based remote replication options for backup data. Once again, the power of data de-duplication is in its ability to reduce the amount of data that is backed up. Because there is less physical data being trafficked over the WAN, data de-duplication lessens the “cost” and/or “bandwidth” barrier of entry of WAN-based remote replication for many organizations, making it possible for some to do WAN-based remote replication for the first time and for others to cast a “wider net” of data protection around their remote data (i.e., include remote data previously not protected).
Implementing Data De-Duplication
There are many ways to implement data de-duplication - it can be done in software or through an appliance. As for the point of origin of the de-duplication process – that is, where the actual data de-duplication is done — it can be done in-line or off-line:
• In-line: The de-duplicating is done at the host by the backup application or by an appliance sitting in the data path.
• Off-line, or post-process: The de-duplicating is done by the system or an appliance sitting outside the backup path after the backup job is complete.
Both approaches are very effective in eliminating duplicate data. But as with any technology, there is a trade-off. In this case, it is performance and capacity. There is a performance impact to doing the de-duplication in the data path and there is a capacity hit to doing the process off-line since capacity has to be initially allocated to the backup process (this capacity is later released after the de-duplication process is complete).
Determining which approach is better for your environment requires a thorough capacity/performance tradeoff analysis. If performance is critical, then the off-line approach may be the better route to take. But if it isn’t — and you’re looking for optimal disk capacity savings (throughout the entire process) — in-line may be the better approach. Of course, in-line versus off-line is just one consideration when evaluating de-duplication. As mentioned previously, technologies also differ in terms of the degree, or level of granularity, of de-duplication they do. All are important considerations when evaluating the different technologies that are available.
That said, it is important to note that while each approach has pros and cons in terms of performance, capacity and cost, storage analysts believe the benefits of data de-duplication — in particular, the potential disk cost savings — are significant enough to warrant the technology’s adoption.
Conclusion
Data de-duplication significantly changes the economics of disk-based data protection and, as such, enables levels of efficiencies above and beyond what’s possible today without it and eliminates problems that plague data centers today. Now companies can recover reliably and quickly, they can back up remote offices and they can minimize tape backups. For these reasons, data de-duplication is a very compelling technology
-By: ‘InfoStore’ Bureau. |