ZFS FAQ » ZFS Dedup FAQ
en

ZFS Dedup FAQ

ZFS Deduplication Frequently Asked Questions (FAQ)

  1. What is ZFS deduplication?
  2. When are the dedup features available?
  3. What should I consider before enabling dedup?
  4. What are the performance impacts of enabling dedup?
  5. How do I enable dedup on my ZFS file systems?
  6. How does ZFS dedup space accounting work?

What is ZFS deduplication?

The ZFS deduplication feature removes redundant data from your ZFS file systems. If a file system has the dedup property enabled, duplicate data blocks are removed synchronously. The result is that only unique data is stored and common components are shared between files. For a detailed description of dedup, see Jeff's blog entry. 

  • Deduplication occurs at the block level and file block sizes might vary.
  • Only file data is deduplicated. File metadata is not deduplicated.
  • Only synchronous deduplication is available.

When are the dedup features available?

SXCE, build 129, with dedup features and fixes, is available in December 2009.

Known dedup CRs and issues:

  • lack of accounting for DDT in dedup'd frees can oversubscribe txg CR 6958873
  • Writing the same data using different compression algorithms will result in data that cannot be deduplicated.

The SXCE build 129 releases provide the following deduplication features:

  • On-disk deduplication - duplicate ZFS data is removed when written to disk if dedup is enabled on a ZFS file system
  • ZFS send deduplication - duplicate ZFS data is removed over the wire when transmitted by using the zfs send -D option
    on ZFS file systems whether dedup is enabled or not

The above dedup features are available in ZFS pool version 22.

What should I consider before enabling dedup?

If you enable dedup on file systems with duplicate data, you should see the benefits of saving space and better performance because less data is written and stored. If you enable dedup on file systems with little duplicate data, you will add system overhead with little benefits gained.

Note: The zdb debugging command can be used to determine the in-core dedup table requirements, but it must be
run on pools that are not in use.

Before you enable dedup, review the following recommendations:

  1. Make sure you review the list of known issues that are provided above.
    2. You can use the zdb command to simulate the potential space savings of enabling dedup on your pool data.
    The following command must be run on a quiet pool.
# zdb -S pool-name

   If the estimated dedup ratio is greater than 2, then you might see dedup space savings.

3. Make sure your system has enough memory to support dedup. Determine the memory requirements for deduplicating your data as follows:

A. Use the zdb -S ouput to determine the in-core dedup table requirements:

  • Each in-core dedup table entry is approximately 320 bytes
  • Multiply the number of allocated blocks times 320. For example:
     in-core DDT size = 3.75M x 320 = 1200M

B. Additional memory considerations from Roch's excellent blog:

20 TB of unique data stored in 128K records or more than 1TB of unique data in 8K records would require about 32 GB of physical memory. If you need to store more unique data than what these ratios provide, strongly consider allocating some large read optimized SSD to hold the deduplication table (DDT). The DDT lookups are small random I/Os that are well handled by current generation SSDs.

What are the performance impacts of enabling dedup?

In general, dedup performance is optimal when the deduplication table fits into memory. If the dedup table has to be written to disk, then performance will decrease. For example, removing a large file system with dedup enabled will severely decrease system performance if the system doesn't meet the memory requirements described above.

Use zdb -DD to display the size of the DDT. This command must be run on a quiet pool.

# zdb -DD pool-name

DDT is considered metadata. Up to 25% of memory (zfs_arc_meta_limit) can be used to store metadata. Monitor size of ZFS memory cache in bytes:

# kstat zfs::arcstats:size

See Roch's blog that describes factors that might impact deduplication performance.

How do I enable dedup on my ZFS file systems?

The dedup property can be enabled on a ZFS file system by using the following syntax:

# zfs set dedup=on export

Enabling the dedup property on an existing file system means that all newly written data is deduplicated. Existing file system data remains duplicated.

Deuplication has a pool-wide scope so a read-only pool property, dedupratio, is provided to determine the deduplication ratio realized for your file systems. For example:

# zpool list
NAME     SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
export   928G  47.5G   881G     5%  1.77x  ONLINE  -
rpool    928G  25.7G   902G     2%  1.40x  ONLINE  - 

A DEDUP ratio of 1.00x generally means that the dedup property is disabled or it has been initially set. As file system deduplication occurs, the DEDUP ratio will generally increase over time.

The zpool list output has changed in this Solaris release. These changes are described in Why has the zpool command changed? 

How do I send deduplicated data?

You must use the zfs send -D syntax to send a deduplicated send stream even if the data is already deduped. If your ZFS data is not deduped, then you can send a deduplicated send stream by using the zfs send -D syntax.

What is the dedup checksum?

The default deduplication checksum is sha256. The following syntax is equivalent:

# zfs set dedup=on export
# zfs set dedup=sha256 export

After the dedup property is enabled on a ZFS file system, the default file system checksum is sha256 for newly created files. Any previously set file system checksum property value, such as the default checksum of fletcher4, is overridden by the dedup property checksum.

Can I verify deduplicated hash comparisons?

You can ask ZFS to verify the SHA256 hash comparisons of blocks to be deduplicated as described in Jeff's blog by using this syntax:

# zfs set dedup=verify export

However, ZFS uses its own copy of SHA256 and doesn't currently use a crypto accelerator or crypto framework.

How does the dedup property interact with the copies property?

A block with copies set to N will always have at least N copies on the system regardless of the number of deduplicated references. 

You can use the dedupditto property to specify a threshold, and if the reference count for a deduped block goes above the threshold, another ditto copy of the block is stored automatically. Need dedupditto values here.

How does ZFS dedup space accounting work?

Deduplicated space accounting is reported at the pool level. You must use the zpool list command rather than the zfs list command to identify disk space consumption when dedup is enabled. If you use the zfs list command to review deduplicated space, you might see that the file system appears to be increasing because we're able to store more data on the same physical device. Using the zpool list will show you how much physical space is being consumed and it will also show you the dedup ratio.

The df command is not dedup-aware and will not provide accurate space accounting.

Tags:
Created by Cindy Swearingen on 2009/12/03 21:43
Last modified by Cindy Swearingen on 2011/09/22 22:00

XWiki Enterprise 2.7.1.34853 - Documentation