[RFC 0/3] Btrfs checksum offload
Mark Harmstone
maharmstone at meta.com
Wed Jan 29 07:55:33 PST 2025
On 29/1/25 14:02, Kanchan Joshi wrote:
> >
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
>
> Now, the longer version for why/how.
>
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
>
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
>
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
>
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
>
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).
>
> [*] Here are some perf/write-amplification numbers from randwrite test [1]
> on 3 configs (same device):
> Config 1: No meta format (4K) + Btrfs (base)
> Config 2: Meta format (4K + 8b) + Btrfs (base)
> Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
>
> In config 1 and 2, Btrfs will operate with a checksum tree.
> Only in config 2, block-layer will attach integrity buffer with each I/O and
> do checksum/reftag verification.
> Only in config 3, offload will take place and device will generate/verify
> the checksum.
>
> AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
> FsW: writes issued to device (from iostat)
> ExtraW: extra writes compared to AppW
>
> Direct I/O
> ---------------------------------------------------------
> Config IOPS(K) FsW(G) ExtraW(G)
> 1 144 186 66
> 2 141 181 61
> 3 172 129 9
>
> Buffered I/O
> ---------------------------------------------------------
> Config IOPS(K) FsW(G) ExtraW(G)
> 1 82 255 135
> 2 80 181 132
> 3 100 199 79
>
> Write amplification is generally high (and that's understandable given
> B-trees) but not sure why buffered I/O shows that much.
>
> [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
>
>
> Kanchan Joshi (3):
> block: add integrity offload
> nvme: support integrity offload
> btrfs: add checksum offload
>
> block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++-
> block/t10-pi.c | 7 +++++++
> drivers/nvme/host/core.c | 24 ++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 1 +
> fs/btrfs/bio.c | 12 +++++++++++
> fs/btrfs/fs.h | 1 +
> fs/btrfs/super.c | 9 +++++++++
> include/linux/blk_types.h | 3 +++
> include/linux/blkdev.h | 7 +++++++
> 9 files changed, 105 insertions(+), 1 deletion(-)
>
There's also checksumming done on the metadata trees, which could be
avoided if we're trusting the block device to do it.
Maybe rather than putting this behind a new compat flag, add a new csum
type of "none"? With the logic being that it also zeroes out the csum
field in the B-tree headers.
Mark
More information about the Linux-nvme
mailing list