User`s guide

Deduplication at target

After backup to a deduplicating vault is completed, the storage node runs the indexing task to

deduplicate data in the vault as follows:

1. It moves the items (disk blocks or files) from the archives to a special file within the vault, storing

duplicate items there only once. This file is called the deduplication data store. If there are both

disk-level and file-level backups in the vault, there are two separate data stores for them. Items

that cannot be deduplicated remain in the archives.

2. In the archives, it replaces the moved items with the corresponding references to them.

As a result, the vault contains a number of unique, deduplicated items, with each item having one or

more references to it from the vault's archives.

The indexing task may take considerable time to complete. You can see this task's state in the Tasks

view on the management server.

Compacting

After one or more backups or archives have been deleted from the vault—either manually or during

cleanup—the vault may contain items which are no longer referred to from any archive. Such items

are deleted by the compacting task, which is a scheduled task performed by the storage node.

By default, the compacting task runs every Sunday night at 03:00. You can re-schedule the task as

described in Actions on storage nodes (p. 315), under "Change the compacting task schedule". You

can also manually start or stop the task from the Tasks view.

Because deletion of unused items is resource-consuming, the compacting task performs it only when

a sufficient amount of data to delete has accumulated. The threshold is determined by the

Compacting Trigger Threshold (p. 331) configuration parameter.

2.12.6.3 When deduplication is most effective

The following are cases when deduplication produces the maximum effect:

 When backing up in the full backup mode similar data from different sources. Such is the case

when you back up operating systems and applications deployed from a single source over the

network.

 When performing incremental backups of similar data from different sources, provided that the

changes to the data are also similar. Such is the case when you deploy updates to these systems

and apply the incremental backup. Again, it is recommended that you first back up one machine

and then the others, all at once or one by one.

 When performing incremental backups of data that does not change itself, but changes its

location. Such is the case when multiple pieces of data circulate over the network or within one

system. Each time a piece of data moves, it is included in the incremental backup which becomes

sizeable while it does not contain new data. Deduplication helps to solve the problem: each time

an item appears in a new place, a reference to the item is saved instead of the item itself.

Deduplication and incremental backups

In case of random changes to the data, de-duplication at incremental backup will not produce much

effect because:

 The deduplicated items that have not changed are not included in the incremental backup.