Deduplication Building Block Guide

Overview

Hardware Requirements for the Nodes

Supported Platforms

Deduplication Requirements

Single Node Deduplication

Two Node Deduplication

Three Node Deduplication

Availability and Store Management

Deduplicating Different Data Types

Designing for Remote Offices

Overview

A Building Block is a combination of server and storage which provides a modular approach for data management.

This building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing right deduplication model will allow you to protect large amount of data with minimal infrastructure, faster backups and better scalability.

Server

For building block, you must choose efficient servers with fastest processors and effective memory that delivers good performance and scalability.

Storage

Before setting up building block, make sure that you plan for sufficient storage space that balances the cost, availability and performance. Sufficient storage space includes:

Space for deduplication database
Space for configuring disk library

Hardware Requirements for the Nodes

The Building Block requires following suggested hardware specifications:

Configuration

Details

Server Specification (Operating System)

X64 Operating System

Minimum Dual Core CPU, Quad Core

32 GB RAM

Windows / Linux

Data Throughput Port

1 exclusive 10 GigE ports

(OR)

4, 1 GigE ports

May need NIC Teaming on Host

Disk Library Configuration

Network Attached Storage (NAS)

Exclusive 10 GigE port

7.2 K RPM SAS spindles

(OR)

SAS / FC / iSCSI

SAS: 6 Gbps HBA

FS: 8 Gbps HBA

iSCSI: Exclusive 10 Gbps NIC

7.2 K RPM SATA/SAS spindles

Min. RAID 5 Raid groups with 7+ spindles each

2 - 8 TB LUNs up to 50 LUNs

Dedicated Storage Adaptors

Deduplication Database (DDB) LUN

The deduplication database needs to be on a fast disk for optimal backup performance. Before setting up the deduplication database, the storage volumes must be validated for high performance. This can be done using IO meter tool which measures the IOPs (Input Output Operations per second). The following table illustrates the recommendation on DDB volume size and IOPs ratings.

For more information on how to use IoMeter, see IOPs For Storage Volumes.

Node Type	RAM	Suggested DDB Volume Size	IOPs With Single worker thread	IOPs With 8 Worker Threads	Estimated Front End Size*	Estimated Backend Size**	Example Configurations
Extra Large	32-64 GB	800 GB - 1 TB	IOPs exhibited by SSD or Flash based storage devices No need to run IO Meter to measure IOPs		Up to 60 TB	Up to 120 TB	Fusion-io ioDrive2 785 GB PCI Express 2.0x4 MLC Solid State Drive (SSD) (OR) 4x SSDs (using NAND Flash) with 600 GB, 3 RAID 5 and 1 hot spare configuration
Large	32-64 GB	800 GB - 1 TB	320+	1000+	Up to 40 TB	60-90 TB	8 Spindles, 15K RPM SAS/FC, 300 + GB each, RAID 10 configuration
Medium	24-32 GB	600 GB	260+	800+	20 - 30 TB	30-60 TB	6 Spindles, 15 K RPM SAS/FC, 300+ GB each, RAID 10 configuration
Small	16-24 GB	400 GB	220+	600+	Up to 20 TB	Up to 30 TB	4 Spindles, 15K RPM SAS/FC, 300+GB each, RAID 10 configuration
Extra Small	16 GB	200 GB	200+	350+	Up to 10 TB	Up to 15 TB	2 Spindles, 15K RPM SAS/FC, 300GB each, RAID 1 configuration

Assuming 30-90 days of retention with weekly full backup cycle.

* - Total size of Application Data backed up in the first full backup.

** - Total data footprint on disk library.

Examples

Examples of the servers that meet the above requirements

Servers	Blades
Dell R720 with H710 and H810 controllers and MD storage	Dell M610 blades on Dell M1000e enclosure with 10 GigE backplane with EqualLogic or MD3000i storage OR 8 Gbps FC fabric
HP DL 380 G6 with 480i internal controller and FC/10 GigE iSCSI/ Gbps SAS for external storage	HP BL 460 or BL600 blades on in HP c7000 enclosure with 8 Gbps FC fabric and 10 GigE Ethernet fabric
IBM x3550 or above with internal SAS controller and external SAS/FC/10 GigE iSCSI controller	IBM JS, PS or HS blade servers with FC/10 GigE fabrics

Supported Platforms

Deduplication database can be located on any of the following platforms:

Windows	All platforms supported by Windows MediaAgents, except 64-bit editions on Intel Itanium (IA64) and Windows XP. Supported on NTFS.
Linux	All platforms supported by Linux MediaAgents, except Power PC (Includes IBM System p). Supported on ext3 and ext4.
Microsoft Cluster Service (MSCS)	Clusters supported by Windows MediaAgents. Supported on NTFS.
Linux Cluster	Clusters supported by Linux MediaAgents. Supported on ext3 and ext4.

Deduplication Requirements

The following aspects need to be considered before configuring the deduplication in your environment.

Storage Policy
Deduplication Database
Disk Library
Block Size
Compression
Data Streams
Datapaths

Storage Policy

Deduplication is centrally managed through storage policies. Each storage policy can maintain its own deduplication settings or can be associated to a global deduplication storage policy. Depending upon the type of data and production size you can use dedicated storage policy or global deduplication policy.

Deduplication Storage Policy

A dedicated deduplication storage policy consists of one library, one deduplication database and one or more MediaAgents. For scalability purposes, using a dedicated deduplication policy allows for the efficient movement of very large amounts of data.

Dedicated policies are recommended, when backing up large amount of data with separate data types that do not deduplicate well against each other such as database and file system data.

For more information, see Getting Started - Deduplication.

Global Deduplication Policy

Global deduplication storage policy provides one large global deduplication store which can be shared by multiple deduplication storage policy copies. Each storage policy can manage specific content and its own retention rules. However, all participating storage policy copies share the same data paths (which consists of MediaAgents and Disk Library mount paths) and the global deduplication store.

Client computers - subclients cannot be associated to a Global Deduplication Storage Policy. They should be associated only to standard storage policies.
Once a storage policy copy is associated to a Global Deduplication Storage Policy, you cannot change the association.
Multiple copies within a storage policy cannot use the same Global Deduplication Storage Policy.

Global deduplication policy are recommended:

Data that exists in multiple remote sites and is being consolidated into a centralized data center.
For small data size with different retention requirements.

For more information, see Getting Started - Global Deduplication.

Deduplication Database

Deduplication database maintains all signature hash records for a deduplication storage policy. This database can scale to maximum of 750 million records. This limit is equivalent to 90 TB of data residing on the disk library and 900 TB of production data (application data), assuming a 10:1 deduplication ratio. For better performance, it is suggested to have 50 concurrent connections or streams for deduplication database.

It is always recommended to host a single deduplication database per node. However, certain workloads may require higher concurrency but lower capacity (e.g., DLO solutions). In this case, maximum of 2 deduplication databases are recommended on a given deduplication database MediaAgent. Note that these deduplication databases must be hosted on different disk drives.

Also, it is recommended that to locate the deduplication database locally on the MediaAgent. The faster the disk performance the more efficient the data protection and deduplication process will be.

Disk Library

The Disk Library consists of disk devices that point to the location of the Disk Library folders. Each disk device may have a read/write path or read only path. The read/write path is for the MediaAgent controlling the mount path to perform backup. The read only path is for the alternate MediaAgent to be able to read the data from the host MediaAgent. This is to allow for restores or auxiliary copy operations while the local MediaAgent is busy.

During the deduplication backups:

It is recommended to use dedicated disk libraries for each MediaAgent.
Non-deduplication data should backup to a separate disk library.
Configuring the data types into separate disk libraries allows for easier reporting on the overall deduplication savings.

If the non-deduplicated and deduplication data are written to the single library, it will skew the overall disk usage information and make space usage prediction difficult.

Each Building Block can support 100 TB of disk storage. The disk storage should be partitioned into 2 – 4 TB LUNs and configured as mount points in the operating system. This LUN size is recommended to allow for ease of maintenance for the Disk Library. Additionally, a larger array of smaller LUNs reduces the impact of a failure of a given LUN.

For disk storage the mount paths can be divided into two types:

NAS paths (Disk Library over shared storage)
- This is the preferred method for a mount path configuration.
- In NAS paths the disk storage is on the network and the MediaAgent connects through a network protocol.
- If a MediaAgent goes offline, the disk library is still accessible by other MediaAgents in the library.
Direct Attached Block Storage (Disk Library over Direct Attached Storage)
- In direct attached block storage (SAN) the mount paths are locally attached to the MediaAgent.
- If a MediaAgent is lost then the disk library is offline. Secondly, all network communication to the mount path occurs from the MediaAgent to the NAS device.
- In a direct attached design, configure the mount paths as mount points instead of drive letters. This allows for larger capacity solutions to configure more mount paths than the drive letters.
- Smaller capacity sites can use drive letters as long as they do not exceed the number of available drive letters.

Block Size

By default, the deduplication storage policy block size is set to 128 K. This recommendation is for all data types other than databases larger than 1 TB.

For large databases, the block size can be configured to 256 K (1 TB to 5 TB) or 512 K (> 5 TB) and should be configured without software compression enabled at the policy level. Any block from 4 K to the configured block size will automatically be hashed and checked into the deduplication database.

Block size can be configured from the Storage Policy Properties - Advanced tab. When configuring the global deduplication policy, all other storage policy copies that are associated with the global deduplication policy must use the same block size. To modify the block size of global deduplication policy, see Block Size for Global Deduplication for step-by-step instructions.

Compression

By default, when a deduplication storage policy is configured, compression is automatically enabled on policy level. This settings will override the subclients compression settings. For most data types compression is recommended. This compresses the blocks and generates a signature hash on the compressed block.

For database agents, data compression is not recommended as it performs compression on the data before handing it off to the database agent which causes the data to expand.

When global deduplication storage policy is configured, this settings will override the storage policy and the subclients compression settings.

Data Streams

The stream configuration in a Storage Policy design is also important. When a Round-Robin design is configured, ensure the total number of streams across the storage policies associated to the Global Deduplication Storage Policy does not exceed 50. This ensures that no more than 50 jobs will protect data at a given time and overload the deduplication database.

For example, a Global Deduplication Storage Policy may have four associated storage policies with 50 streams each for a total of 200 streams. If all storage policies were in concurrent use, the deduplication database would have 200 connections and performance would degrade. By limiting the number of writers to a total of 50, all 200 jobs may start, however, only 50 will run at any one time. As resources become available from jobs completing, the waiting jobs will resume.

Datapaths

Consider the following when using SAN storage for Data Path configuration:

When using SAN storage for the mount path, use Alternate Data Paths -> When Resources are offline -> immediately.
If a data path fails or is marked offline for maintenance, the job will failover to the next data path configured in the Data Path tab.
Although Round-Robin between Data paths will work for SAN storage it’s not recommended because of the performance penalty during DASH copies and restores. This is because of the multiple hops that have to occur in order to restore or copy the data.

Consider the following when using NAS storage for Data Path configuration:

When using NAS storage for the mount path, Round Robin, between Data Paths is recommended. This is configured in the Storage Policy copy properties -> Data Path Configuration tab of the storage policy associated to the global deduplication policy and not in the Global Deduplication Policy properties.
NAS mount paths do not have the same performance penalty because the network communication is between the servicing MediaAgent and the NAS mount path directly.

Single Node Deduplication

Single node deduplication is useful in small and ad-hoc environments where the production (application) data is approximately 40 TB or less. This can be configured in one of the following way:

Using Deduplication Storage Policy
Using Global Deduplication Policy

Using Deduplication Storage Policy

For small amounts of data, it is recommended to use standard storage policy with deduplication enabled. This setup consist of:

one deduplication storage policy
one library
one deduplication database
one MediaAgent

This method is recommended when your environment has application data less than 40 TB and require one retention criteria.

Configure Disk Library.
For step-by-step instructions, see Getting Started - Disk Library.

The disk storage should be partitioned into 2 - 8 TB LUNs or 20 TB NAS share configured as mount paths.
Create deduplication storage policy.
For step-by-step instructions, see Getting Started - Deduplication.

Ensure that the deduplication database is hosted on the high performance LUN.

Using Global Deduplication Policy

For small environments, that do not contain a large amount of data but different retention settings are required, then multiple storage policy copies can be associated with a global deduplication policy.

This method is recommended for small environments with the data path defined through a single MediaAgent.

Configure Disk Library.
See Getting Started - Disk Library, for step-by-step instructions.

The disk storage should be partitioned into 2 - 8 TB LUNs or 20 TB NAS share configured as mount paths.
Create Global Deduplication Policy and storage policy with global deduplication enabled.
See Getting Started - Global Deduplication, for step-by-step instructions.

Ensure that the deduplication database is hosted on the high performance LUN.
Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.
See Block Size for Global Deduplication for step-by-step instructions.
By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.
For step-by-step instructions, see Setting Up Data Compression.
Additionally, you can create a storage policy for data storage or for one or more storage policies with different retention requirement and associate them to the global deduplication policy.

Two Node Deduplication

Two node deduplication is useful to protect the production data more than 40 TB and less than 80 TB. This model uses:

Two global deduplication policy (one per node)
Two disk libraries shared across two nodes
Two deduplication database (one per node)
Two MediaAgents

This setup holds:

Capacity (size on disk) - 120 TB average to 180 TB across 2 stores
Throughput - 4 TB/hr Average (8 TB/hr max)
Concurrency - 100 Streams across two stores
Disk Library Shared between both nodes
Up to 50 X 2 - 8 TB LUNs per node (requires read-only shares to other node) or Shared NAS storage between both nodes

Use the following steps to setup two deduplication nodes:

Create a shared disk library.
See Getting Started - Shared Disk Library for step-by-step instructions.
Create Global deduplication policy.
See Getting Started - Global Deduplication Policy for step-by-step instructions.
Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.
See Block Size for Global Deduplication for step-by-step instructions.
By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.
For step-by-step instructions, see Setting Up Data Compression.
Repeat step 1 - 4 for Store 2.
For Store 1, the preferred datapath is Disk Library 1 (MediaAgent 1)
Add Disk Library 2 which is available on MediaAgent 2 as alternate datapath for failover.

See Configuring Data Paths for step-by-step instructions.
For Store 2, the preferred datapath is Disk Library 2 (MediaAgent 2)
Add Disk Library 1 which is available on MediaAgent 1 as alternate datapath for failover.

See Configuring Data Paths for step-by-step instructions.

Three Node Deduplication

Three node deduplication is useful to protect the production data more than 80 TB and less than 120 TB. This model uses:

Three global deduplication policy (one per node)
Three disk library shared across three nodes
Three deduplication database (one per node)
Three MediaAgents

This setup holds:

Capacity (size on disk) - 180 TB average to 270 TB across 3 stores
Throughput - 6 TB/hr Average (12 TB/hr max)
Concurrency - 150 Streams across three stores
Disk Library Shared between both nodes - Up to 50 X 2 - 8 TB LUNs per node (requires read-only shares to other node) or Shared NAS storage between both nodes

Use the following steps to setup three deduplication nodes:

Create a shared disk library.
See Getting Started - Shared Disk Library for step-by-step instructions.
Create Global deduplication policy.
See Getting Started - Global Deduplication Policy for step-by-step instructions.
Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.
See Block Size for Global Deduplication for step-by-step instructions.
By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.
For step-by-step instructions, see Setting Up Data Compression.
Repeat step 1 - 4 for Store 2 & 3.
For Store 1, the preferred datapath is Disk Library 1 (MediaAgent 1)
Add Disk Library 2 (MediaAgent 2) & Disk Library 3 (MediaAgent 3) as alternate datapaths for failover.

See Configuring Data Paths for step-by-step instructions.
For Store 2, the preferred datapath is Disk Library 2 (MediaAgent 2)
Add Disk Library 1 (MediaAgent 1) & Disk Library 3 (MediaAgent 3) as alternate datapaths for failover.

See Configuring Data Paths for step-by-step instructions.
For Store 3, the preferred datapath is Disk Library 3 (MediaAgent 3)
Add Disk Library 1 (MediaAgent 1) & Disk Library 2 (MediaAgent 2) as alternate datapaths for failover.

See Configuring Data Paths for step-by-step instructions.

Deduplicating Different Data Types

In a larger environment, certain data types do not deduplicate well against others such as database data against file system data. In addition, data types from application such as SQL and Oracle will perform application level compression, where as file system data may not.

So, for the best performance and scalability when backing up different data types (File System data, SQL data and Exchange data), it is best practice to have different global deduplication policies to protect different data types.

During this setup, set recommended block size depending upon the data type.

By default, the block size is set to 128 K, for large databases it is recommended to set to 256 K to allow for higher scalability.

The diagram on the right describes different data types using separate global deduplication policy to protect and manage deduplicated data.

Availability and Store Management

How Does Deduplication Database Backup Work?

Using the File System Agent is a preferred method for a deduplication database backup. This method allows the periodic backup of the deduplication database without requiring jobs to pause. This is done as follows:

Installing File System iDataAgent as a Restore Only Agent (which does not consume license) on the MediaAgent hosting the deduplication database.
Configuring the file system subclient by selecting DDB Subclient option, this will automatically define the deduplication database as the subclient content.
Scheduling the deduplication database backup job as frequently as desired for database protection.

When the deduplication database backup job is initiated:

All communication to the active deduplication database is paused for a few seconds. The information in memory is committed to disk to ensure the deduplication database is in a quiesced state.

It quickly creates a snapshot of the volume using VSS (Windows) / LVM (Unix) snapshots.

Unix platform also support Extended 3 File System (ext3), and VERITAS Volume Manager (VxVM) file systems.

If you have configured LVM volumes, ensure that enough disk space is available to accommodate LVM snapshots. It is recommended that you have atleast 15% of unallocated space on the volume group.

Also, make sure that the amount of COW space to reserve while creating snapshots is atleast set to 10% of the Logical Volume size. See How to resize COW space size for snapshot creation? for step-by-step instructions.

Once the snapshot is created, all the communication to the deduplication database is resumed and it automatically backs up the deduplication database from that snapshot.

After a deduplication database has been backed up successfully, the snapshot is deleted. Throughout this time, the Job Controller will show the jobs in a running state.

This method backs up all active (non-sealed) deduplication databases available in that MediaAgent.

To set up the deduplication database backup using file system, see Backing Up Deduplication Store Database.

How Does Deduplication Database Reconstruction Work?

In the event of deduplication store failure, the reconstruction job is automatically initiated which will restore from the deduplication backup. During this process:

Add the new records into the restored deduplication database for all the backups that went to the deduplication database after the backup, till the point of offline.
Prune those records of the backups which got aged and pruned after the last deduplication backup till the point of offline.

How Often Do I Need To Schedule Deduplication Database Backups?

Schedule the deduplication database backup as frequently as desired for database protection. Recommended 6 to 8 hours schedules.

How Should I Schedule Deduplication Database Backups with respective to Data Aging Jobs to Minimize the Time for Reconstruction?

After the completion of the data aging job scheduled on the CommServe, the physical pruning on the disk library begins. The best time to run the deduplication database backup is when all the physical pruning of the data blocks on the disk library is complete, so that the reconstruction job that uses the deduplication database snapshot from this backup job will not have to replay that many prune records. The physical pruning usually takes a few hours after the completion of the data aging job. So statistically, it will be good to run the deduplication database backup at the mid point between two Data Aging jobs.

So for example, if the data aging is scheduled to run for every 6 hours, the recommended schedules are:

Data Aging: Every 6 hours starting at 3:00 AM.

Deduplication Database Backup: Every 6 hours starting at 6:00 AM.

How Should I Schedule Deduplication Database Backups with Respective to File System Backups using the same Storage Policy?

Schedule the deduplication database backup in such a way that it runs when there are a fewer backups in progress as possible. This will ensure that the deduplication database backup will finish sooner.

How Can I Setup The Alerts on Deduplication Database Backups?

You can configure alerts for deduplication database backup jobs to receive alerts when a backup job fails and when there are no backup jobs.

See Configure Configure Alerts for Deduplication Store Backup step by step instructions.

Designing for Remote Offices

Consider a setup with multiple remote sites and a centralized data center. Each remote site backs up the internal data using individual storage policies and saves a copy of the backup on the centralized data center. Here, though the redundant data within the individual backup can be eliminated using deduplication on primary copies at the remote site, the secondary copies stored at the data center might still contain redundant data among the copies. This redundant data can be identified and eliminated using global deduplication.

For step-by-step instructions on how to setup remote office backups, see Remote Office Backup Using Global Deduplication.

Last Updated On: 24 April 2013 (Release 9.0.0 Service Pack 10)

Deduplication Building Block Guide

Table of Contents

Server

Storage

Server Specification (Operating System)

Data Throughput Port

Disk Library Configuration

Deduplication Database (DDB) LUN

Node Type

RAM

Suggested DDB Volume Size

IOPs With Single worker thread

IOPs With 8 Worker Threads

Estimated Front End Size*

Estimated Backend Size**

Example Configurations

Examples

Servers

Blades

Windows

Linux

Microsoft Cluster Service (MSCS)

Linux Cluster

NAS paths (Disk Library over shared storage)

Direct Attached Block Storage (Disk Library over Direct Attached Storage)

How Does Deduplication Database Backup Work?