Deduplication Building Block Guide

Table of Contents

Overview

Hardware Requirements for the Nodes

Supported Platforms

Deduplication Requirements

Single Node Deduplication

Two Node Deduplication

Three Node Deduplication

Availability and Store Management

Deduplicating Different Data Types

Designing for Remote Offices

Overview

A Building Block is a combination of server and storage which provides a modular approach for data management.

This building block guide illustrates how to choose the right number of deduplication nodes depending on your environment, storage requirements and the production data size. Choosing right deduplication model will allow you to protect large amount of data with minimal infrastructure, faster backups and better scalability.

Server

For building block, you must choose efficient servers with fastest processors and effective memory that delivers good performance and scalability.

Storage

Before setting up building block, make sure that you plan for sufficient storage space that balances the cost, availability and performance. Sufficient storage space includes:

Hardware Requirements for the Nodes

The Building Block requires following suggested hardware specifications:

Configuration Details

Server Specification (Operating System)

X64 Operating System

Minimum Dual Core CPU, Quad Core

32 GB RAM

Windows / Linux

Data Throughput Port

1 exclusive 10 GigE ports (OR) 4, 1 GigE ports

May need NIC Teaming on Host

Disk Library Configuration

Network Attached Storage (NAS)

Exclusive 10 GigE port

7.2 K RPM SAS spindles

(OR) SAS / FC / iSCSI

SAS: 6 Gbps HBA

FS: 8 Gbps HBA

iSCSI: Exclusive 10 Gbps NIC

7.2 K RPM SATA/SAS spindles

Min. RAID 5 Raid groups with 7+ spindles each

2 - 8 TB LUNs up to 50 LUNs

Dedicated Storage Adaptors

Deduplication Database (DDB) LUN

The deduplication database needs to be on a fast disk for optimal backup performance. Before setting up the deduplication database, the storage volumes must be validated for high performance. This can be done using IO meter tool which measures the IOPs (Input Output Operations per second). The following table illustrates the recommendation on DDB volume size and IOPs ratings.

For more information on how to use IoMeter, see IOPs For Storage Volumes.

Node Type

RAM

Suggested DDB Volume Size

IOPs With Single worker thread

IOPs With 8 Worker Threads

Estimated Front End Size*

Estimated Backend Size**

Example Configurations

Extra Large 32-64 GB 800 GB - 1 TB IOPs exhibited by SSD or Flash based storage devices

No need to run IO Meter to measure IOPs

Up to 60 TB Up to 120 TB Fusion-io ioDrive2 785 GB PCI Express 2.0x4 MLC Solid State Drive (SSD)

(OR)

4x SSDs (using NAND Flash) with 600 GB, 3 RAID 5 and 1 hot spare configuration

Large 32-64 GB 800 GB - 1 TB 320+ 1000+ Up to 40 TB 60-90 TB 8 Spindles, 15K RPM SAS/FC, 300 + GB each, RAID 10 configuration
Medium 24-32 GB 600 GB 260+ 800+ 20 - 30 TB 30-60 TB 6 Spindles, 15 K RPM SAS/FC, 300+ GB each, RAID 10 configuration
Small 16-24 GB 400 GB 220+ 600+ Up to 20 TB Up to 30 TB 4 Spindles, 15K RPM SAS/FC, 300+GB each, RAID 10 configuration
Extra Small 16 GB 200 GB 200+ 350+ Up to 10 TB Up to 15 TB 2 Spindles, 15K RPM SAS/FC, 300GB each, RAID 1 configuration

Assuming 30-90 days of retention with weekly full backup cycle.

* - Total size of Application Data backed up in the first full backup.

** - Total data footprint on disk library.

Examples

Examples of the servers that meet the above requirements

Servers

Blades

Dell R720 with H710 and H810 controllers and MD storage Dell M610 blades on Dell M1000e enclosure with 10 GigE backplane with EqualLogic or MD3000i storage OR 8 Gbps FC fabric
HP DL 380 G6 with 480i internal controller and FC/10 GigE iSCSI/ Gbps SAS for external storage HP BL 460 or BL600 blades on in HP c7000 enclosure with 8 Gbps FC fabric and 10 GigE Ethernet fabric
IBM x3550 or above with internal SAS controller and external SAS/FC/10 GigE iSCSI controller IBM JS, PS or HS blade servers with FC/10 GigE fabrics

Supported Platforms

Deduplication database can be located on any of the following platforms:

Windows

All platforms supported by Windows MediaAgents, except 64-bit editions on Intel Itanium (IA64) and Windows XP.

Supported on NTFS.

Linux

All platforms supported by Linux MediaAgents, except Power PC (Includes IBM System p).

Supported on ext3 and ext4.

Microsoft Cluster Service (MSCS)

Clusters supported by Windows MediaAgents.

Supported on NTFS.

Linux Cluster

Clusters supported by Linux MediaAgents.

Supported on ext3 and ext4.

Deduplication Requirements

The following aspects need to be considered before configuring the deduplication in your environment.

Storage Policy

Deduplication is centrally managed through storage policies. Each storage policy can maintain its own deduplication settings or can be associated to a global deduplication storage policy. Depending upon the type of data and production size you can use dedicated storage policy or global deduplication policy.

Deduplication Database

Deduplication database maintains all signature hash records for a deduplication storage policy. This database can scale to maximum of 750 million records. This limit is equivalent to 90 TB of data residing on the disk library and 900 TB of production data (application data), assuming a 10:1 deduplication ratio. For better performance, it is suggested to have 50 concurrent connections or streams for deduplication database.

It is always recommended to host a single deduplication database per node. However, certain workloads may require higher concurrency but lower capacity (e.g., DLO solutions). In this case, maximum of 2 deduplication databases are recommended on a given deduplication database MediaAgent. Note that these deduplication databases must be hosted on different disk drives.

Also, it is recommended that to locate the deduplication database locally on the MediaAgent. The faster the disk performance the more efficient the data protection and deduplication process will be.

Disk Library

The Disk Library consists of disk devices that point to the location of the Disk Library folders. Each disk device may have a read/write path or read only path. The read/write path is for the MediaAgent controlling the mount path to perform backup. The read only path is for the alternate MediaAgent to be able to read the data from the host MediaAgent. This is to allow for restores or auxiliary copy operations while the local MediaAgent is busy.

During the deduplication backups:

If the non-deduplicated and deduplication data are written to the single library, it will skew the overall disk usage information and make space usage prediction difficult.

Each Building Block can support 100 TB of disk storage. The disk storage should be partitioned into 2 – 4 TB LUNs and configured as mount points in the operating system. This LUN size is recommended to allow for ease of maintenance for the Disk Library. Additionally, a larger array of smaller LUNs reduces the impact of a failure of a given LUN.

For disk storage the mount paths can be divided into two types:

Block Size

By default, the deduplication storage policy block size is set to 128 K. This recommendation is for all data types other than databases larger than 1 TB.

For large databases, the block size can be configured to 256 K (1 TB to 5 TB) or 512 K (> 5 TB) and should be configured without software compression enabled at the policy level. Any block from 4 K to the configured block size will automatically be hashed and checked into the deduplication database.

Block size can be configured from the Storage Policy Properties - Advanced tab. When configuring the global deduplication policy, all other storage policy copies that are associated with the global deduplication policy must use the same block size. To modify the block size of global deduplication policy, see Block Size for Global Deduplication for step-by-step instructions.

Compression

By default, when a deduplication storage policy is configured, compression is automatically enabled on policy level. This settings will override the subclients compression settings. For most data types compression is recommended. This compresses the blocks and generates a signature hash on the compressed block.

For database agents, data compression is not recommended as it performs compression on the data before handing it off to the database agent which causes the data to expand.

When global deduplication storage policy is configured, this settings will override the storage policy and the subclients compression settings.

Data Streams

The stream configuration in a Storage Policy design is also important. When a Round-Robin design is configured, ensure the total number of streams across the storage policies associated to the Global Deduplication Storage Policy does not exceed 50. This ensures that no more than 50 jobs will protect data at a given time and overload the deduplication database.

For example, a Global Deduplication Storage Policy may have four associated storage policies with 50 streams each for a total of 200 streams. If all storage policies were in concurrent use, the deduplication database would have 200 connections and performance would degrade. By limiting the number of writers to a total of 50, all 200 jobs may start, however, only 50 will run at any one time. As resources become available from jobs completing, the waiting jobs will resume.

Datapaths

Consider the following when using SAN storage for Data Path configuration:

Consider the following when using NAS storage for Data Path configuration:

Single Node Deduplication

Single node deduplication is useful in small and ad-hoc environments where the production (application) data is approximately 40 TB or less. This can be configured in one of the following way:

Using Deduplication Storage Policy

For small amounts of data, it is recommended to use standard storage policy with deduplication enabled. This setup consist of:

This method is recommended when your environment has application data less than 40 TB and require one retention criteria.

  1. Configure Disk Library.

    For step-by-step instructions, see Getting Started - Disk Library.

    The disk storage should be partitioned into 2 - 8 TB LUNs or 20 TB NAS share configured as mount paths.

  2. Create deduplication storage policy.

    For step-by-step instructions, see Getting Started - Deduplication.

    Ensure that the deduplication database is hosted on the high performance LUN.

Using Global Deduplication Policy

For small environments, that do not contain a large amount of data but different retention settings are required, then multiple storage policy copies can be associated with a global deduplication policy.

This method is recommended for small environments with the data path defined through a single MediaAgent.

  1. Configure Disk Library.

    See Getting Started - Disk Library, for step-by-step instructions.

    The disk storage should be partitioned into 2 - 8 TB LUNs or 20 TB NAS share configured as mount paths.

  2. Create Global Deduplication Policy and storage policy with global deduplication enabled.

    See Getting Started - Global Deduplication, for step-by-step instructions.

    Ensure that the deduplication database is hosted on the high performance LUN.

  3. Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.

    See Block Size for Global Deduplication for step-by-step instructions.

  4. By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.

    For step-by-step instructions, see Setting Up Data Compression.

  5. Additionally, you can create a storage policy for data storage or for one or more storage policies with different retention requirement and associate them to the global deduplication policy.

Two Node Deduplication

Two node deduplication is useful to protect the production data more than 40 TB and less than 80 TB. This model uses:

This setup holds:

Use the following steps to setup two deduplication nodes:
  1. Create a shared disk library.

    See Getting Started - Shared Disk Library for step-by-step instructions.

  2. Create Global deduplication policy.

    See Getting Started - Global Deduplication Policy for step-by-step instructions.

  3. Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.

    See Block Size for Global Deduplication for step-by-step instructions.

  4. By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.

    For step-by-step instructions, see Setting Up Data Compression.

  5. Repeat step 1 - 4 for Store 2.
  6. For Store 1, the preferred datapath is Disk Library 1 (MediaAgent 1)

    Add Disk Library 2 which is available on MediaAgent 2 as alternate datapath for failover.

    See Configuring Data Paths for step-by-step instructions.

  7. For Store 2, the preferred datapath is Disk Library 2 (MediaAgent 2)

    Add Disk Library 1 which is available on MediaAgent 1 as alternate datapath for failover.

    See Configuring Data Paths for step-by-step instructions.

Three Node Deduplication

Three node deduplication is useful to protect the production data more than 80 TB and less than 120 TB. This model uses:

This setup holds:

Use the following steps to setup three deduplication nodes:
  1. Create a shared disk library.

    See Getting Started - Shared Disk Library for step-by-step instructions.

  2. Create Global deduplication policy.

    See Getting Started - Global Deduplication Policy for step-by-step instructions.

  3. Ensure that deduplication block size on global deduplication policy and associated storage policy are configured with same block size.

    See Block Size for Global Deduplication for step-by-step instructions.

  4. By default the compression settings is enabled on the global deduplication policy. Depending upon the application data type make sure to turn on/off the compression settings.

    For step-by-step instructions, see Setting Up Data Compression.

  5. Repeat step 1 - 4 for Store 2 & 3.
  6. For Store 1, the preferred datapath is Disk Library 1 (MediaAgent 1)

    Add Disk Library 2 (MediaAgent 2) & Disk Library 3 (MediaAgent 3) as alternate datapaths for failover.

    See Configuring Data Paths for step-by-step instructions.

  7. For Store 2, the preferred datapath is Disk Library 2 (MediaAgent 2)

    Add Disk Library 1 (MediaAgent 1) & Disk Library 3 (MediaAgent 3) as alternate datapaths for failover.

    See Configuring Data Paths for step-by-step instructions.

  8. For Store 3, the preferred datapath is Disk Library 3 (MediaAgent 3)

    Add Disk Library 1 (MediaAgent 1) & Disk Library 2 (MediaAgent 2) as alternate datapaths for failover.

    See Configuring Data Paths for step-by-step instructions.

Deduplicating Different Data Types

In a larger environment, certain data types do not deduplicate well against others such as database data against file system data. In addition, data types from application such as SQL and Oracle will perform application level compression, where as file system data may not.

So, for the best performance and scalability when backing up different data types (File System data, SQL data and Exchange data), it is best practice to have different global deduplication policies to protect different data types.

During this setup, set recommended block size depending upon the data type.

  • By default, the block size is set to 128 K, for large databases it is recommended to set to 256 K to allow for higher scalability.

The diagram on the right describes different data types using separate global deduplication policy to protect and manage deduplicated data.

Availability and Store Management

How Does Deduplication Database Backup Work?

Using the File System Agent is a preferred method for a deduplication database backup. This method allows the periodic backup of the deduplication database without requiring jobs to pause. This is done as follows:

This method backs up all active (non-sealed) deduplication databases available in that MediaAgent.

To set up the deduplication database backup using file system, see Backing Up Deduplication Store Database.

How Does Deduplication Database Reconstruction Work?

In the event of deduplication store failure, the reconstruction job is automatically initiated which will restore from the deduplication backup. During this process:

How Often Do I Need To Schedule Deduplication Database Backups?

Schedule the deduplication database backup as frequently as desired for database protection. Recommended 6 to 8 hours schedules.

How Should I Schedule Deduplication Database Backups with respective to Data Aging Jobs to Minimize the Time for Reconstruction?

After the completion of the data aging job scheduled on the CommServe, the physical pruning on the disk library begins. The best time to run the deduplication database backup is when all the physical pruning of the data blocks on the disk library is complete, so that the reconstruction job that uses the deduplication database snapshot from this backup job will not have to replay that many prune records. The physical pruning usually takes a few hours after the completion of the data aging job. So statistically, it will be good to run the deduplication database backup at the mid point between two Data Aging jobs.

So for example, if the data aging is scheduled to run for every 6 hours, the recommended schedules are:

Data Aging: Every 6 hours starting at 3:00 AM.

Deduplication Database Backup: Every 6 hours starting at 6:00 AM.

How Should I Schedule Deduplication Database Backups with Respective to File System Backups using the same Storage Policy?

Schedule the deduplication database backup in such a way that it runs when there are a fewer backups in progress as possible. This will ensure that the deduplication database backup will finish sooner.

How Can I Setup The Alerts on Deduplication Database Backups?

You can configure alerts for deduplication database backup jobs to receive alerts when a backup job fails and when there are no backup jobs.

See Configure Configure Alerts for Deduplication Store Backup step by step instructions.

Designing for Remote Offices

Consider a setup with multiple remote sites and a centralized data center. Each remote site backs up the internal data using individual storage policies and saves a copy of the backup on the centralized data center. Here, though the redundant data within the individual backup can be eliminated using deduplication on primary copies at the remote site, the secondary copies stored at the data center might still contain redundant data among the copies. This redundant data can be identified and eliminated using global deduplication.

For step-by-step instructions on how to setup remote office backups, see Remote Office Backup Using Global Deduplication.

 

Last Updated On: 24 April 2013 (Release 9.0.0 Service Pack 10)