Deduplication

Topics | How To | Support | Related Topics


Overview

Deduplication Architecture

N-Instancing (Number of Instances)

License Requirements

How to Set up Deduplication

Configure Deduplication Options

How does Deduplication Work

Best Practices for Deduplication

Other Considerations

Disabling Deduplication


Overview

Deduplication eliminates storing redundant data in backup storage. Deduplication stores a single copy of the redundant data, and any subsequent references to the same data are stored as pointers to the previously stored copy. For example, consider the typical scenario of a mail system backup where the same (5 MB) sales presentation is available in say, 30 inboxes. Instead  of storing 30 copies of the 5 MB presentation, deduplication-enabled backup will only store a single copy of the 5 MB attachment, and have pointers to the saved copy for any subsequent backup of the same presentation.

The efficient use of storage space enables storage of large volumes of backup data, reducing the cost of backup storage which in turn facilitates longer retention periods.

The software supports deduplication of data stored in Magnetic as well as Tape Libraries.

Deduplication is supported by both backup and data archival products in order to provide optimization in storage when the data to be backed up contains redundant data. For complete details on deduplication support, see Deduplication - Support.

Block Level Deduplication

Deduplication at the data block level compares blocks of data against each other. Block level deduplication allows you to deduplicate data within a given object. If an object (file, database, etc.) contains blocks of data that are identical to each other, then block level deduplication eliminates storing the redundant data and reduces the size of the object in storage.

Consider a backup containing data from Exchange Server or SQL Server database. Block level Deduplication will divide the data into individual data blocks and then compare the different blocks against each other. If a data block is unique, then the block is stored on the media. If a data block is found to be identical to a previously stored block, then it is stored as a pointer to the previously stored block.

Identical data blocks within an object, as well as from objects within the same storage policy copy are deduplicated.

The block level deduplication is illustrated in the graphic.

Object Level Deduplication

Deduplication at the object level compares two objects against each other, for example files, images, database, etc. If the entire object is found to be identical, then the object is deduplicated.

Object level deduplication is only available with Data De-Duplication Enabler license from the previous release.
Consider the following scenario:

Assuming full backups are run on multiple client computers with the same operating system and applications, copies of the install folders containing the system files and dynamic-link library (DLL) files will be added to storage.

Deduplication substantially reduces the storage required for the backup, by grouping the clients and associating them with the same deduplication enabled storage policy. The repetitive data is stored once, and any subsequent copy of the same data points to the stored copy.

The object level deduplication is illustrated in the graphic.

In the above examples the numbers are meant for illustration purposes only. In reality, additional space requirements (overheads) for storing metadata like File Access Control Lists will apply.

Deduplication Architecture

Deduplication uses a hashing algorithm to compare data. A Signature Generation Module computes the hashed signature for each object/block and then compares it with the existing signatures maintained in the Deduplication Store to determine whether it is identical. Based on the comparison, the MediaAgent performs one of the following operations:

The deduplicated data are stored in specially designed container files to increase the system throughput and scalability.

Note that there are appliances that support single instancing of data (e.g., Centera). Such hardware can be configured as a magnetic library with the option to write the data in a single-instance enabled format. See Enable Support for Single Instancing of Data (Content Address Storage) for more information.

Signature Generation Module

The signature generation module uses SHA 512 (Secure Hash Algorithm) along with the size of the data to generate unique signatures for objects/blocks. This combination eliminates the possibility of collisions, where two objects/blocks hash to the same value.

The signature generation module can be configured either on the Client or the MediaAgent. Note that it is recommended to be run on the Client as it is both memory and resource intensive. See Configure Deduplication for a Subclient for instructions to configure signature generation.

Deduplication Store

The Deduplication store or the Deduplication Database serves as the repository for signatures associated with all objects/blocks that are backed up and reference counts to copies of the objects/blocks that are backed up using the storage policy copy. Deduplication stores are maintained for each Storage Policy Copy that has the deduplication option enabled. Multiple MediaAgents can be a part of the same copy and use the same Deduplication Store provided the libraries accessed by the MediaAgents are configured as static shared libraries and accessible from all the MediaAgents.

The deduplication store access path is configured when creating a storage policy copy, both for primary and secondary storage policy copies. The MediaAgent associated with data store could be any one of the MediaAgents in the data paths, or outside of the data path too. Once created, you can change the MediaAgent hosting the deduplication store.
  • Note that the deduplication database must be located in a folder and not directly under the root of a disk volume.
  • Do not manually delete the Deduplication Store. The Deduplication Store facilitates the deduplication backup jobs and data aging jobs. If deleted, new deduplicated backup jobs cannot be performed and the existing data in the disk mount paths will never be pruned.

You can use the deduplication tool to evaluate a disk for hosting the deduplication store. The tool simulates the deduplication store operations on the disk and provides recommendations on the store size, average access time, and disk performance statistics. See Evaluate a Disk for Hosting the Deduplication Store for instructions.

Disk Specifications for Hosting the Deduplication Database

To ensure optimal performance for deduplication operations, the disk hosting the deduplication store must satisfy the following specifications. Note that these specifications are only for the disk hosting the deduplication store, and not for the entire mount path.

Configure Deduplication Store Creation

By default, a new deduplication store is created for every 100 TB of data. Note that this is the amount of data stored on the media after deduplication. Depending upon your configuration and requirement, deduplication store creation can be configured based on the following parameters:

See Configure Deduplication Store Creation for step-by-step instructions.

Seal Deduplication Store On-Demand

The currently active deduplication store can be sealed on-demand. When a deduplication store is sealed, no new data is deduplicated with the existing data in the store; the current store is closed. A new store is automatically created, and deduplication on new backup jobs is recorded in the new store. This option is useful in rare cases of hardware issues with disk corruption where creating a new store will prevent new data from referencing any of the old data in the corrupted disks.

See Seal the Active Deduplication Store for instructions.

Manage the Deduplication Store

You can use the SIDB tool to administer the deduplication store and collect diagnostics information. The SIDB tool can be used for the following purposes:


N-Instancing (Number of Instances)

N-Instancing is the capability to specify the number of copies of the deduplication object/block to be created in the deduplication storage. N-Instancing increases the integrity of data protection operations in the deduplication mode. The backup operation initially stores the first instance of the data/block. When the same object/block is encountered subsequently, then further instances are stored. During restore operations, if one object/block is unreadable, it continues to read the data from the other available instance of the copy. The instances are referenced in a round-robin fashion.

N-Instancing can be set in the Redundancy Factor parameter. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions to configure the number of instances stored.


License Requirements

This feature requires a Feature License to be available in the CommServe® Server.

Review general license requirements included in License Administration. Also, View All Licenses provides step-by-step instructions on how to view the license information.

The following license is required for this feature:

Block Level De-Duplication license for using deduplication. One license required for each MediaAgent hosting the Deduplication Store.

If you have upgraded the MediaAgent, Data De-Duplication Enabler license from the  previous release can be used for object level deduplication. Deduplication from previous release continue as object level deduplication in this release.

How to Set up Deduplication

Deduplication option can be enabled (or disabled) only during the copy creation. Once created, deduplication cannot be enabled or disabled in the copies.

Review the following before enabling deduplication.

Use the following steps to implement deduplication on backup storage:

  1. Create a new Storage Policy, and perform the following when creating the storage policy.

    See Enable Deduplication for a Primary Copy for step-by-step instructions.

    If you select block level deduplication, set the block size for deduplication in the storage policy properties. See Configure Block Size for Block Level Deduplication for step-by-step instructions.
  2. To enable deduplication in secondary copies, create a secondary copy and enable deduplication upon creation. See Enable Deduplication for a Secondary Copy to enable deduplication for secondary copies. If you select block level deduplication, set the block size for deduplication in the storage policy properties. See Configure Block Size for Block Level Deduplication for step-by-step instructions.
  3. Configure deduplication options for storage policy copies. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  4. Make sure that all the clients that you wish to deduplicate point to this storage policy copy. See Associate a Subclient to a Storage Policy for step-by-step instructions.
  5. Enable/disable and configure signature generation on all the subclients associated with the storage policy copy. See Configure Deduplication for a Subclient for step-by-step instructions.

    If necessary, deduplication can be enabled and configured using a subclient policy as a template and then be associated with the appropriate subclients. See Subclient Policies for more information.

  6. Existing non-deduplicated backup data can also be deduplicated. For existing data, create a storage policy copy with deduplication enabled. Create a secondary copy, and run an auxiliary copy in the secondary copy. If necessary, you can promote the secondary copy as the primary copy so that subsequent backups are automatically deduplicated.

    Note that while object level deduplication is supported for older data; block level deduplication is only supported for data backups performed using the current version of the software.


Configure Deduplication Options

The following deduplication options can be configured:

  1. The number of instances of deduplication objects/blocks created. See N-Instancing for more information. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  2. The minimum free space that must be available at all times in the volume in which the Deduplication Store is configured. If this space is not maintained, deduplication jobs will fail. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  3. The amount of free space in the volume in which deduplication store is configured, reaching which a warning is generated, if configured. See Free Space Warning for Deduplication Store Using Alerts for more information. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  4. For object level deduplication, the minimum size of object to be picked up for deduplication. See Minimum Size of Deduplicable Objects for more information. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  5. The number of days after which a object/block cannot be used for new deduplication reference. See Age of the Primary Object/Block for more information. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  6. Enable compression for all the subclients associated with the storage policy copy. Note this option will enable data compression on subclients, even if compression is not enabled at the corresponding subclient level. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions.
  7. You can change the MediaAgent hosting the deduplication store. See Change the MediaAgent Hosting Deduplication Store for step-by-step instructions.
  8. You can change the location of the deduplication store. See Change the Location of Deduplication Store for step-by-step instructions.
  9. Configure the rules for creating deduplication stores, and seal the active deduplication store. See Configure Deduplication Store Creation, and Seal the Active Deduplication Store for step-by-step instructions.

How does Deduplication Work

The following sections describe the various operations in the deduplication mode.

Data Protections Operations

The sequence of operations is almost similar to a regular data protection job when deduplication is enabled.

When a data protection job is initiated the backup module secures the data and start the data transfer module to the MediaAgent. As the data is secured, the data is compressed, if data compression is enabled on the client, the hashes are computed by the signature generation module if deduplication is enabled on the client and then encrypted if client encryption is enabled.

Data Multiplexing is not supported with Deduplication.

Data Recovery Operations

Data Recovery operations is identical to a regular restore operation and are virtually unaffected by deduplication. Deduplication store is not contacted for normal restore operations, except when the data is not available in the disk and more than one instance is configured using N-Instancing.

All types of restore operations (including Restore by Jobs and Restoring from copies) are supported.

Auxiliary Copy

Auxiliary Copy operations will automatically unravel or explode a deduplicated data. In the case of Auxiliary Copy, if the secondary copy is set up for Deduplication, then the deduplication store gets created for the copy and the associated data is deduplicated for secondary copy. The Auxiliary copy follows the deduplication type (both block level and object level) of the Source Copy.

  • Data in a storage policy copy enabled for Deduplication can not be multiplexed. Therefore, Data Multiplexing is not supported if the storage policy copy is enabled with Deduplication. However, a SILO copy supports Data Multiplexing even if the storage policy copy is enabled with Deduplication.
  • Multiplexed data cannot be copied to a storage policy copy enabled for Deduplication. Therefore, a storage policy copy enabled for Deduplication can not have a direct or indirect source copy enabled for Data Multiplexing.
  • An Auxiliary Copy can be configured with Data Multiplexing when the source copy is enabled for Deduplication.

Data Aging Operations

Data Aging operations will automatically look up the deduplication store before data is deleted from the disk. Data Aging will only delete the source data when all the references to a given object/block is pruned. So if you see older chunks in magnetic libraries remaining on the volume even if the original data is deleted, it might be due to the fact that deduplication reference(s) to the chunk is still valid.

Data Encryption and Data Compression

When Data Encryption and/or Data Compression is enabled the system automatically runs the signature module after data compression and before data encryption. If the setup contradicts this order, the system will automatically perform compression, signature generation and encryption in the source client computer. Pass phrase protected data encryption is not supported.


Best Practices for Deduplication

Review the following best practices before using deduplication.

Space Requirements for Deduplication Store

The following calculations can be used to approximately determine the amount of space required for the deduplication store partition:

For example, if a client has 100 files to be backed up, with a schedule of every day full backups for 30 days, the size of deduplication store should be (0.2*100*30) KB = 600 KB. Note that this size is the maximum space required, and the deduplication store size could be much lesser if there is a good level of deduplication possible. So in the above example, if the 100 files do not change at all, then the amount will be (0.1*100*30) KB = 300 KB.

Minimum Size of Deduplicable Objects

For object level deduplication, the minimum size of the data object for which deduplication will be attempted can be defined in the Minimum Size of Deduplicable Object parameter. A typical backup stream consists of the actual data and the metadata (such as ACLs) associated with the object/block. It is recommended that the minimum object/block size for deduplication be greater than 50 KB; reducing this to a smaller size may not always result in significant space saving. See Configure Deduplication Options for Storage Policy Copies for instructions to change the minimum size of deduplicable objects.

Age of the Primary Object/Block

The period of time that a deduplicated data is used as the primary reference for any new secondary object can be defined in the Do not Deduplicate against objects older than parameter. This will ensure that very old objects/block are not allowed as the 'Origin' data for newer data protection jobs that are deduplicated. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions to set the age of the primary object/block.

  • To obtain optimal results, we recommended that the values for Minimum Size of Deduplicable Object and Do not Deduplicate against objects older than are not set below the default values.

Free Space Warning for Deduplication Store Using Alerts

If the amount of free space falls below the specified amount in the volume in which the deduplication store is stored, the MediaAgent generates an event message and generates the MediaAgents (Disk Space Low) alert, if configured. If the space is less than 10% the job will not continue.

See Available Alerts and Alert Descriptions and Space Check Thresholds for the Software Installation and System Directories for detailed information on setting up the alert.

If the free space on the volume in which Deduplication Store is configured goes below 10% of the total space in the volume, deduplicated jobs will fail.

Clustered File System Storage

Magnetic libraries are supported on clustered file systems such as Global File Systems (GFS), Cluster File Systems (CFS), IBM General Parallel File System (GPFS), and Polyserve File System.

Deduplication configurations that require multiple MediaAgents must use clustered file system storage for magnetic libraries. Clustered file system storage allows configuring multiple MediaAgents to access and share the same data on disk mount paths. This enables all MediaAgents of a storage policy copy using the same deduplication store to have access to data written by other MediaAgents as well.

Rebooting a MediaAgent

You might reboot a MediaAgent for installing updates or maintenance purposes. For MediaAgents controlling the deduplication database, you will have to ensure that all the deduplication transactions in the memory are completed before rebooting. Failure to follow the recommendations might result in sealing of the deduplication store, which will increase the amount of storage space consumed in the primary disk library. See Rebooting a MediaAgent Hosting the Deduplication Store for step-by-step instructions.


Other Considerations

Audit Trail

Operations performed with this feature are recorded in the Audit Trail. See Audit Trail for more information.

Related Reports

Storage Policy Report provides necessary information on jobs that are deduplicated.

Upgrade Considerations

Consider the following impact on deduplication when upgrading the MediaAgent:

Backward Compatibility

If the CommServe is upgraded, but the MediaAgent and the clients are not upgraded, then deduplication will be available only at the Object Level.

CommServe Database Restore

It is recommended that the CommServe Database is not restored to a different point in time when there are Storage Policies configured with Deduplication. In case the CommServe database restore is necessary to recover the CommServe, then after performing the CommServe database restore, the current Deduplication store must be sealed, and subsequent backup jobs must be executed with the new Deduplication Store. This will prevent any synchronization issues between deduplication database and CommServe database.

Deduplication Jobs on Migrated CommCell

After CommCell migration, the deduplication store operates in the read-only mode on the destination CommCell. The migrated (deduplication enabled) storage policies on the destination CommCell can be used to restore the deduplicated data migrated from the source CommCell and perform auxiliary copy jobs with the migrated data as the source. However, the migrated storage policies on the destination CommCell cannot be used to perform new deduplication backup jobs.

Spool Copy

Deduplication-enabled Storage Policy Copies cannot be configured as Spool Copies. Note that existing Deduplicated Spool Copies will continue to exist until the Spool Copy retention setting is removed. Once removed, the deduplicated copy cannot be configured as a Spool Copy.

Installing Software Updates

When installing updates or patches in a deduplication-enabled setup, ensure all deduplication-enabled jobs are either suspended or stopped prior to installing the updates or patches. This will prevent accidental sealing of deduplication stores due to services being stopped when data protection operations are in progress.

Retire a Mount Path

When a mount path is no longer required, you can disable the mount path and reuse the disk space. To do this, first disable write operations on the mount path. This prevents the mount path from being used for subsequent write operations. The existing data in the mount path will be pruned based on the data aging settings. Once all the data in the mount path is completely pruned, disable the mount path and reuse the disk space. Note that mount paths used in a deduplication configuration cannot be deleted until the corresponding storage policy is deleted.


Disabling Deduplication

Once enabled, deduplication cannot be disabled from a storage policy copy. However, you can use the following workaround.

Suspend Deduplication

Although deduplication cannot be disabled, it can be temporarily suspended. When suspended, the backup data is not deduplicated and the deduplication store is not accessed for signature verification. You can use this feature if you wish to temporarily detach the deduplication store to gain access to the store, primarily for diagnostics and maintenance purposes. Once you resume deduplication, the signature verification and data deduplication is resumed. See Suspend/Resume Deduplication for instructions.


Back to Top