Topics | How To | Support | Related Topics
N-Instancing (Number of Instances)
Configure Deduplication Options
Best Practices for Deduplication
Deduplication eliminates storing redundant data in backup storage. Deduplication stores a single copy of the redundant data, and any subsequent references to the same data are stored as pointers to the previously stored copy. For example, consider the typical scenario of a mail system backup where the same (5 MB) sales presentation is available in say, 30 inboxes. Instead of storing 30 copies of the 5 MB presentation, deduplication-enabled backup will only store a single copy of the 5 MB attachment, and have pointers to the saved copy for any subsequent backup of the same presentation.
The efficient use of storage space enables storage of large volumes of backup data, reducing the cost of backup storage which in turn facilitates longer retention periods.
The software supports deduplication of data stored in Magnetic as well as Tape Libraries.
Deduplication is supported by both backup and data archival products in order to provide optimization in storage when the data to be backed up contains redundant data. For complete details on deduplication support, see Deduplication - Support.
Deduplication at the data block level compares blocks of data against each other. Block level deduplication allows you to deduplicate data within a given object. If an object (file, database, etc.) contains blocks of data that are identical to each other, then block level deduplication eliminates storing the redundant data and reduces the size of the object in storage.
Deduplication at the object level compares two objects against each other, for example files, images, database, etc. If the entire object is found to be identical, then the object is deduplicated.
Object level deduplication is only available with Data De-Duplication Enabler license from the previous release. |
In the above examples the numbers are meant for illustration purposes only. In reality, additional space requirements (overheads) for storing metadata like File Access Control Lists will apply. |
Deduplication uses a hashing algorithm to compare data. A Signature Generation Module computes the hashed signature for each object/block and then compares it with the existing signatures maintained in the Deduplication Store to determine whether it is identical. Based on the comparison, the MediaAgent performs one of the following operations:
The deduplicated data are stored in specially designed container files to increase the system throughput and scalability.
Note that there are appliances that support single instancing of data (e.g., Centera). Such hardware can be configured as a magnetic library with the option to write the data in a single-instance enabled format. See Enable Support for Single Instancing of Data (Content Address Storage) for more information. |
The signature generation module uses SHA 512 (Secure Hash Algorithm) along with the size of the data to generate unique signatures for objects/blocks. This combination eliminates the possibility of collisions, where two objects/blocks hash to the same value.
The signature generation module can be configured either on the Client or the MediaAgent. Note that it is recommended to be run on the Client as it is both memory and resource intensive. See Configure Deduplication for a Subclient for instructions to configure signature generation.
The Deduplication store or the Deduplication Database serves as the repository for signatures associated with all objects/blocks that are backed up and reference counts to copies of the objects/blocks that are backed up using the storage policy copy. Deduplication stores are maintained for each Storage Policy Copy that has the deduplication option enabled. Multiple MediaAgents can be a part of the same copy and use the same Deduplication Store provided the libraries accessed by the MediaAgents are configured as static shared libraries and accessible from all the MediaAgents.
The deduplication store access path is configured when creating a storage policy copy, both for primary and secondary storage policy copies. The MediaAgent associated with data store could be any one of the MediaAgents in the data paths, or outside of the data path too. Once created, you can change the MediaAgent hosting the deduplication store.
|
You can use the deduplication tool to evaluate a disk for hosting the deduplication store. The tool simulates the deduplication store operations on the disk and provides recommendations on the store size, average access time, and disk performance statistics. See Evaluate a Disk for Hosting the Deduplication Store for instructions.
To ensure optimal performance for deduplication operations, the disk hosting the deduplication store must satisfy the following specifications. Note that these specifications are only for the disk hosting the deduplication store, and not for the entire mount path.
Disk performance can be measured using the CvDiskPerf tool. See Measure Disk Performance for step-by-step instructions, sample commands, and sample output. The following table provides sample disk performance calculations:
Disk Performance |
Throughput in GB/Hour |
|
Write |
Read |
|
Sample 1 | 341.3798 | 477.6198 |
Sample 2 | 344.3546 | 513.2807 |
Sample 3 | 340.8644 | 575.6513 |
Sample 4 | 428.8675 | 499.7836 |
Sample 5 | 397.6285 | 426.5668 |
Sample 6 | 438.2224 | 503.0041 |
Sample 7 | 428.0591 | 494.4092 |
Sample 8 | 427.0613 | 643.4305 |
Sample 9 | 446.6219 | 523.7768 |
Sample 10 | 396.5592 | 581.3948 |
Average | 398.9619 | 523.8918 |
By default, a new deduplication store is created for every 100 TB of data. Note that this is the amount of data stored on the media after deduplication. Depending upon your configuration and requirement, deduplication store creation can be configured based on the following parameters:
See Configure Deduplication Store Creation for step-by-step instructions.
The currently active deduplication store can be sealed on-demand. When a deduplication store is sealed, no new data is deduplicated with the existing data in the store; the current store is closed. A new store is automatically created, and deduplication on new backup jobs is recorded in the new store. This option is useful in rare cases of hardware issues with disk corruption where creating a new store will prevent new data from referencing any of the old data in the corrupted disks.
See Seal the Active Deduplication Store for instructions.
You can use the SIDB tool to administer the deduplication store and collect diagnostics information. The SIDB tool can be used for the following purposes:
N-Instancing is the capability to specify the number of copies of the deduplication object/block to be created in the deduplication storage. N-Instancing increases the integrity of data protection operations in the deduplication mode. The backup operation initially stores the first instance of the data/block. When the same object/block is encountered subsequently, then further instances are stored. During restore operations, if one object/block is unreadable, it continues to read the data from the other available instance of the copy. The instances are referenced in a round-robin fashion.
N-Instancing can be set in the Redundancy Factor parameter. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions to configure the number of instances stored.
This feature requires a Feature License to be available in the CommServe® Server.
Review general license requirements included in License Administration. Also, View All Licenses provides step-by-step instructions on how to view the license information.
The following license is required for this feature:
Block Level De-Duplication license for using deduplication. One license required for each MediaAgent hosting the Deduplication Store.
If you have upgraded the MediaAgent, Data De-Duplication Enabler license from the previous release can be used for object level deduplication. Deduplication from previous release continue as object level deduplication in this release. |
Deduplication option can be enabled (or disabled) only during the copy creation. Once created, deduplication cannot be enabled or disabled in the copies.
Review the following before enabling deduplication.
When a Storage Policy is cloned, the deduplication properties of the cloned Storage Policy will be disabled and cannot be enabled. Hence it is recommended that you do not clone storage policies if you wish to enable deduplication.
Use the following steps to implement deduplication on backup storage:
See Enable Deduplication for a Primary Copy for step-by-step instructions.
If you select block level deduplication, set the block size for deduplication in the storage policy properties. See Configure Block Size for Block Level Deduplication for step-by-step instructions.If necessary, deduplication can be enabled and configured using a subclient policy as a template and then be associated with the appropriate subclients. See Subclient Policies for more information.
Note that while object level deduplication is supported for older data; block level deduplication is only supported for data backups performed using the current version of the software.
The following deduplication options can be configured:
The following sections describe the various operations in the deduplication mode.
The sequence of operations is almost similar to a regular data protection job when deduplication is enabled.
When a data protection job is initiated the backup module secures the data and start the data transfer module to the MediaAgent. As the data is secured, the data is compressed, if data compression is enabled on the client, the hashes are computed by the signature generation module if deduplication is enabled on the client and then encrypted if client encryption is enabled.
Data Multiplexing is not supported with Deduplication.
Data Recovery operations is identical to a regular restore operation and are virtually unaffected by deduplication. Deduplication store is not contacted for normal restore operations, except when the data is not available in the disk and more than one instance is configured using N-Instancing.
All types of restore operations (including Restore by Jobs and Restoring from copies) are supported.
Auxiliary Copy operations will automatically unravel or explode a deduplicated data. In the case of Auxiliary Copy, if the secondary copy is set up for Deduplication, then the deduplication store gets created for the copy and the associated data is deduplicated for secondary copy. The Auxiliary copy follows the deduplication type (both block level and object level) of the Source Copy.
|
Data Aging operations will automatically look up the deduplication store before data is deleted from the disk. Data Aging will only delete the source data when all the references to a given object/block is pruned. So if you see older chunks in magnetic libraries remaining on the volume even if the original data is deleted, it might be due to the fact that deduplication reference(s) to the chunk is still valid.
When Data Encryption and/or Data Compression is enabled the system automatically runs the signature module after data compression and before data encryption. If the setup contradicts this order, the system will automatically perform compression, signature generation and encryption in the source client computer. Pass phrase protected data encryption is not supported.
Review the following best practices before using deduplication.
The following calculations can be used to approximately determine the amount of space required for the deduplication store partition:
For example, if a client has 100 files to be backed up, with a schedule of every day full backups for 30 days, the size of deduplication store should be (0.2*100*30) KB = 600 KB. Note that this size is the maximum space required, and the deduplication store size could be much lesser if there is a good level of deduplication possible. So in the above example, if the 100 files do not change at all, then the amount will be (0.1*100*30) KB = 300 KB.
For object level deduplication, the minimum size of the data object for which deduplication will be attempted can be defined in the Minimum Size of Deduplicable Object parameter. A typical backup stream consists of the actual data and the metadata (such as ACLs) associated with the object/block. It is recommended that the minimum object/block size for deduplication be greater than 50 KB; reducing this to a smaller size may not always result in significant space saving. See Configure Deduplication Options for Storage Policy Copies for instructions to change the minimum size of deduplicable objects.
The period of time that a deduplicated data is used as the primary reference for any new secondary object can be defined in the Do not Deduplicate against objects older than parameter. This will ensure that very old objects/block are not allowed as the 'Origin' data for newer data protection jobs that are deduplicated. See Configure Deduplication Options for Storage Policy Copies for step-by-step instructions to set the age of the primary object/block.
|
If the amount of free space falls below the specified amount in the volume in which the deduplication store is stored, the MediaAgent generates an event message and generates the MediaAgents (Disk Space Low) alert, if configured. If the space is less than 10% the job will not continue.
See Available Alerts and Alert Descriptions and Space Check Thresholds for the Software Installation and System Directories for detailed information on setting up the alert.
If the free space on the volume in which Deduplication Store is configured goes below 10% of the total space in the volume, deduplicated jobs will fail. |
Magnetic libraries are supported on clustered file systems such as Global File Systems (GFS), Cluster File Systems (CFS), IBM General Parallel File System (GPFS), and Polyserve File System.
Deduplication configurations that require multiple MediaAgents must use clustered file system storage for magnetic libraries. Clustered file system storage allows configuring multiple MediaAgents to access and share the same data on disk mount paths. This enables all MediaAgents of a storage policy copy using the same deduplication store to have access to data written by other MediaAgents as well.
You might reboot a MediaAgent for installing updates or maintenance purposes. For MediaAgents controlling the deduplication database, you will have to ensure that all the deduplication transactions in the memory are completed before rebooting. Failure to follow the recommendations might result in sealing of the deduplication store, which will increase the amount of storage space consumed in the primary disk library. See Rebooting a MediaAgent Hosting the Deduplication Store for step-by-step instructions.
Operations performed with this feature are recorded in the Audit Trail. See Audit Trail for more information.
Storage Policy Report provides necessary information on jobs that are deduplicated.
Consider the following impact on deduplication when upgrading the MediaAgent:
When one of the MediaAgent associated with the Deduplication solution is upgraded to current version, then all the other MediaAgents in that storage policy copy, MediaAgents associated with the data path as well as the Deduplication Store, must be upgraded.
Consider the following deduplication options:
MediaAgents with the Data De-duplication Enabler license from the previous release can be used for Object Level Deduplication.
If the CommServe is upgraded, but the MediaAgent and the clients are not upgraded, then deduplication will be available only at the Object Level.
It is recommended that the CommServe Database is not restored to a different point in time when there are Storage Policies configured with Deduplication. In case the CommServe database restore is necessary to recover the CommServe, then after performing the CommServe database restore, the current Deduplication store must be sealed, and subsequent backup jobs must be executed with the new Deduplication Store. This will prevent any synchronization issues between deduplication database and CommServe database.
After CommCell migration, the deduplication store operates in the read-only mode on the destination CommCell. The migrated (deduplication enabled) storage policies on the destination CommCell can be used to restore the deduplicated data migrated from the source CommCell and perform auxiliary copy jobs with the migrated data as the source. However, the migrated storage policies on the destination CommCell cannot be used to perform new deduplication backup jobs.
Deduplication-enabled Storage Policy Copies cannot be configured as Spool Copies. Note that existing Deduplicated Spool Copies will continue to exist until the Spool Copy retention setting is removed. Once removed, the deduplicated copy cannot be configured as a Spool Copy.
When installing updates or patches in a deduplication-enabled setup, ensure all deduplication-enabled jobs are either suspended or stopped prior to installing the updates or patches. This will prevent accidental sealing of deduplication stores due to services being stopped when data protection operations are in progress.
When a mount path is no longer required, you can disable the mount path and reuse the disk space. To do this, first disable write operations on the mount path. This prevents the mount path from being used for subsequent write operations. The existing data in the mount path will be pruned based on the data aging settings. Once all the data in the mount path is completely pruned, disable the mount path and reuse the disk space. Note that mount paths used in a deduplication configuration cannot be deleted until the corresponding storage policy is deleted.
Once enabled, deduplication cannot be disabled from a storage policy copy. However, you can use the following workaround.
Although deduplication cannot be disabled, it can be temporarily suspended. When suspended, the backup data is not deduplicated and the deduplication store is not accessed for signature verification. You can use this feature if you wish to temporarily detach the deduplication store to gain access to the store, primarily for diagnostics and maintenance purposes. Once you resume deduplication, the signature verification and data deduplication is resumed. See Suspend/Resume Deduplication for instructions.