June 19, 2017

Guest post contributed by Allan Jude

ZFS was designed to be the last filesystem. Unlike other file systems, that were designed to be used for the next 10 years, ZFS was designed to last. Instead of being built with limits to outlast the next decade, ZFS was designed with scalable data structures that not only avoided limits on how big your volume or disks could be, but also to avoid limiting how many filesystems, volumes, and disks you could have. A single ZFS pool can theoretically address 256 quadrillion zettabytes of data, spread across up to 16,384 quadrillion disks. These numbers seem absurd, but the benefit of this design is that ZFS doesn’t lose significant amounts of performance if you have 100s, or 100s of thousands of snapshots. The scalable design means ZFS can be as big as you need it to be, there are no practical limits, and no point where the performance suddenly falls off a cliff.

What is ZFS?

ZFS is a copy-on-write filesystem, contrasting it to traditional filesystems that overwrite data in place. When you make a change to a file, the changed block is written to a new location, instead of being written over top of the original version. This enabled two of ZFS’s biggest features: 1) the filesystem is always consistent, there is never a need for a scan after a power failure or other disruption, and 2) the operator can create instant snapshots of each filesystem. If no snapshots exist, when the new version of the file is written, and the metadata that attaches the filename to that block of data is updated, the old version of the data becomes free space, available to be used by another file. The power comes when the operator has created a snapshot, in this case the copies of any blocks that were in use when the snapshot was taken, are kept, even when their content is superseded with newer data. Now, you have access to both versions of the file, the live version, and the copy of the file how it existed at the exact moment of the snapshot. The data that has not changed is shared between both copies, meaning this second copy only consumes space for the blocks that differ.

The primary design consideration for ZFS was the safety of the your data. Every block that is written to the filesystem is accompanied by a checksums of the data stored with the other metadata. That metadata block also has a checksum, as does its parent, all the way up to the top level block, called the uber block. When the ZFS filesystem is mounted, is examines the available array of uber blocks, and selects the newest one with a valid checksum. When combined with the copy-on-write feature, this means that in the event of a power failure or system crash, ZFS will still have a consistent view of the file system, any operations that were in progress and rolled back, and the filesystem is in pristine shape. This means no need for a long filesystem check after an unexpected shutdown.

Every time a block of a file is read from a ZFS filesystem, the data returned by the disk is checksummed, and that checksum is verified against the one stored in the metadata. If the results differ, this means the disk has returned incorrect data. ZFS will detect this, keeping a count of such errors for each disk, as this may be a sign of impending disk failure. If your ZFS filesystem is configured with redundancy, like mirrors or RAID-Z, this parity information will be used to reconstruct the incorrect block, and write the repaired data back to the disk. If there is no redundancy, and the data cannot be recovered, an error will be returned. This allows the operating system to stop an application from using invalid data, which may cause it to crash or do the wrong thing.

If even a single byte of data is wrong, it can have a major impact on your data. Take this group photo from a FreeBSD Developers Summit:

If your disk corrupts even just part of a single byte of the data, the entire photo is ruined:

Now imagine what happens to a database, where a single bit gets flipped and now a regular user has administrator permissions, or the database cannot handle the unexpectedly inconsistent data, and crashes.

Truly, the most frightening case is when the corruption is not noticed. Years later, you realize that you have been backing up the corrupted photo, and you have discarded your older backups and now you have no copy of the photo from before the corruption. Traditional filesystems have no way to detecting such corruption, let alone correcting it. ZFS will sound the alarm, and if configured with redundancy, solve the problem seamlessly.

While every block is checked as each file is accessed, what about files you do not access frequently, or ever? ZFS has an operation called a “scrub”, that reads every block in the filesystem and ensures it is correct. Scheduling this operation periodically will ensure that any problems are detected and corrected before it is too late.

To really understand the power of ZFS, you need to try it out. Luckily, this doesn’t require you to go out and buy a bunch of disks, you can do it all virtually:

Create a test pool

Before switching your disks over to ZFS, you can experiment a little to learn the basics of ZFS. Download and start a FreeBSD VM. ZFS can create a storage pool out of any block device, or for testing purposes, regular files.

Create three regular 2 gigabyte files:

# truncate -s 2g /tmp/file1
# truncate -s 2g /tmp/file2
# truncate -s 2g /tmp/file3

Create a new RAID-Z1 storage pool using these files:

 # zpool create testpool raidz1 /tmp/file1 /tmp/file2 /tmp/file3

Examine the pool

# zpool status testpool
 pool: testpool
 state: ONLINE
 scan: none requested
 config:
    NAME      STATE   READ  WRITE  CKSUM
      testpool ONLINE  0     0       0
      raidz1-0 ONLINE  0     0       0
     /tmp/file1 ONLINE 0     0       0
     /tmp/file2 ONLINE 0     0       0
     /tmp/file3 ONLINE 0     0       0

errors: No known data errors

Test the redundancy

Write a file to the new pool:

 # jot -s 'this is a test' 1000 /testpool/first_file.txt

Unmount the pool:

# zpool export testpool

Now purposely corrupt one of the virtual disks:

# dd if=/dev/zero of=/tmp/file2 bs=1m count=2k skip=1

Reimport the pool:

 # zpool import -d /tmp testpool

Check the status of the pool:

 #zpool status testpool
 pool: testpool
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error. An
 attempt was made to correct the error. Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
 see: http://illumos.org/msg/ZFS-8000-9P
 scan: scrub repaired 0 in 0h0m with 0 errors on Fri May 26 00:27:51 2017
 config:
       NAME    STATE    READ    WRITE   CKSUM
       testpool ONLINE   0       0        0
       raidz1-0 ONLINE   0       0        0
     /tmp/file1 ONLINE   0       0        0
     /tmp/file2 ONLINE   0       0        5
     /tmp/file3 ONLINE   0       0        0
 errors: No known data errors

ZFS verifies the checksum of every block as it is read. If the checksum does not match ZFS knows that the disk has returned incorrect data, and attempts to recover the correct data from parity, mirrors, or additional copies of the data that may exist.

In the output of the zpool status command you can see that 5 blocks had the wrong checksum when they were read.

When storing a large number of files, there may be some files that are rarely or never read. To ensure these files remain intact, ZFS has an operation called a scrub that can be run periodically to verify all data in the pool.

Run a scrub on the pool

# zpool scrub testpool

Monitor the status of the scrub operation:

# zpool status testpool
 pool: testpool
 state: ONLINE
 status: One or more devices has experienced an unrecoverable error. An
 attempt was made to correct the error. Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
 see: http://illumos.org/msg/ZFS-8000-9P
 scan: scrub repaired 33.5K in 0h0m with 0 errors on Fri May 26 00:38:21 2017
 config:
      NAME       STATE   READ   WRITE   CKSUM
      testpool   ONLINE   0       0       0
      raidz1-0   ONLINE   0       0       0
      /tmp/file1 ONLINE   0       0       0
      /tmp/file2 ONLINE   0       0      31
      /tmp/file3 ONLINE   0       0       0
 errors: No known data errors

ZFS detected, and corrected, additional errors on the virtual disk that was overwritten with zeros.

ZFS provides protection against disk failures, similar to the protection provided by other forms of RAID, however ZFS also provides protection against data errors, such a flipped bits, a disk returning incorrect data, or a controller writing data to the wrong place on disk.

Conclusion

ZFS is the premier filesystem for data security, scalability, and flexibility. It can grow to meet whatever your needs might be, allows you to partition your data as your needs and workloads change, without imposing complexity or rigidity. Adding additional storage is as easy as plugging in the additional disks. All of the data is stored with up to triple redundancy to ensure that individual component failures will not interrupt your workload or damage your data.
If you want to learn more about ZFS on FreeBSD, look for “FreeBSD Mastery: ZFS” and “FreeBSD Mastery: Advanced ZFS” at your favourite bookstore, ebook retailer, or visit ZFSBook.com.