Re: zfs update

On Sat, Apr 19, 2014 at 8:51 PM, Robert G. (Doc) Savage <dsavage@peaknet.net> wrote:

When I created the ZFS pool in my storage pod a few weeks ago, I didn't know about zpool's ashift=12 option that changes the default 512-byte block size to 4096-byte to work with new drives.

And I really should have made the raidz parity count match the max number of drives I'll have in an array. I can grow the array by adding drives, but once the raidz setting is made it cannot be changed. The following are conservative recommendations and not mandatory.

Max # drives   Setting    Equivalent   Array capacity
3-5            raidz      RAID5         3.6T - 14.4T
6-10           raidz2     RAID6        14.4T - 28.8T
11-15          raidz3     RAID7        28.8T - 43.2T

I have 2.3T of data on my ZFS array, which is still comfortably below the 3.6T ext4-formatted capacity of a single 4T drive. I rsync'd everything from the array (2,230,424 files) to that drive. Even with gigabit Ethernet that took a week -- I *really* need 10GE. When everything was safely backed up, I blew away the old array and created a new one:

# zfs umount pod
# zpool destroy -f pod
# zpool create -f pod ashift=12 raidz3 \
/dev/disk/by-id/scsi-SATA-ST4000NC000-1CD_Z3001HA2 \ These backslashes
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W30063RJ \ are not actually
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300JAJ0 \ typed, but are
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300NLT6 \ shown for clarity
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300J9LA \ of content only.
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300JW66 \ This is one long
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300H3PX \ command line.
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300H59J \
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300F5NR \
/dev/disk/by-id/scsi-SATA-ST4000DM000-1F2_W300KKZ3
# zfs create -p -o sharenfs=on pod
# zfs mount pod

This whole process takes only as long as it does to type. There's no mkfs filesystem creation step for ZFS. But it took another week to rsync everything back. There was one period when the load factor hit 8+ during the transfer of a 40GB virtual machine image.

The new raidz3 array is not yet optimal. Its size precludes the use of ZFS's native deduplication. There's no way I could cram the required 400GB of RAM on the pod's ATX motherboard. To dedupe the array I need to run Steve's linkdups utility:

# cd /pod
# linkdups -r -v

While this minimizes the space occupied by files, a side effect messes with the array parity checksums.

# zpool status
pool: pod
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
        attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
config:

   NAME                                    STATE     READ WRITE CKSUM
   pod                                     ONLINE       0     0     0
     raidz3-0                              ONLINE       0     0     0
       scsi-SATA_ST4000NC000-1CD_Z3001HA2 ONLINE       0     0    26
       scsi-SATA_ST4000DM000-1F2_W30063RJ ONLINE       0     0    25
       scsi-SATA_ST4000DM000-1F2_Z300JAJ0 ONLINE       0     0    21
       scsi-SATA_ST4000DM000-1F2_Z300NLT6 ONLINE       0     0    14
       scsi-SATA_ST4000DM000-1F2_Z300J9LA ONLINE       0     0    30
       scsi-SATA_ST4000DM000-1F2_W300JW66 ONLINE       0     0    13
       scsi-SATA_ST4000DM000-1F2_W300H3PX ONLINE       0     0     4
       scsi-SATA_ST4000DM000-1F2_W300H59J ONLINE       0     0    10
       scsi-SATA_ST4000DM000-1F2_W300F5NR ONLINE       0     0    12
       scsi-SATA_ST4000DM000-1F2_W300KKZ3 ONLINE       0     0    23

errors: No known data errors

These are not faults in the individual drives. They can be corrected by re-computing the checksums by running:

# zpool scrub pod

This process runs invisibly in the background. To check its progress, run zpool status again:

# zpool status
pool: pod
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
        attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub in progress since Sat Apr 19 06:59:26 2014
    48.0G scanned out of 3.25T at 11.5M/s, 81h20m to go
    0 repaired, 1.44% done
config:

NAME                                    STATE     READ WRITE CKSUM
pod                                     ONLINE       0     0     0
raidz3-0                              ONLINE       0     0     0
    scsi-SATA_ST4000NC000-1CD_Z3001HA2 ONLINE       0     0    26
    scsi-SATA_ST4000DM000-1F2_W30063RJ ONLINE       0     0    25
    scsi-SATA_ST4000DM000-1F2_Z300JAJ0 ONLINE       0     0    21
    scsi-SATA_ST4000DM000-1F2_Z300NLT6 ONLINE       0     0    14
    scsi-SATA_ST4000DM000-1F2_Z300J9LA ONLINE       0     0    30
    scsi-SATA_ST4000DM000-1F2_W300JW66 ONLINE       0     0    13
    scsi-SATA_ST4000DM000-1F2_W300H3PX ONLINE       0     0     4
    scsi-SATA_ST4000DM000-1F2_W300H59J ONLINE       0     0    10
    scsi-SATA_ST4000DM000-1F2_W300F5NR ONLINE       0     0    12
    scsi-SATA_ST4000DM000-1F2_W300KKZ3 ONLINE       0     0    23

errors: No known data errors

When the scan/scrub is finished (it won't really take 81 hours), I'll re-zero the CKSUM column by:

# zpool clear

When all that's done I can take that extra 4T drive and append it to the array:

# zfs umount pod
# zpool add -f pod /dev/disk/by-id/scsi-SATA-ST400DM000-1F2_Z300J9G8
# zfs mount pod

The result will be a larger array just as if I'd created it with eleven drives from the beginning.

--Doc