[Lsf-pc] [LSF/MM/BPF ATTEND][LSF/MM/BPF TOPIC] Meta/Integrity/PI improvements

Kent Overstreet kent.overstreet at linux.dev
Thu Apr 4 23:12:01 PDT 2024


On Mon, Feb 26, 2024 at 06:15:19PM -0500, Martin K. Petersen wrote:
> 
> Kanchan,
> 
> > - Generic user interface that user-space can use to exchange meta. A
> > new io_uring opcode IORING_OP_READ/WRITE_META - seems feasible for
> > direct IO.
> 
> Yep. I'm interested in this too. Reviving this effort is near the top of
> my todo list so I'm happy to collaborate.
> 
> > NVMe SSD can do the offload when the host sends the PRACT bit. But in
> > the driver, this is tied to global integrity disablement using
> > CONFIG_BLK_DEV_INTEGRITY.
> 
> > So, the idea is to introduce a bio flag REQ_INTEGRITY_OFFLOAD
> > that the filesystem can send. The block-integrity and NVMe driver do
> > the rest to make the offload work.
> 
> Whether to have a block device do this is currently controlled by the
> /sys/block/foo/integrity/{read_verify,write_generate} knobs. At least
> for SCSI, protected transfers are always enabled between HBA and target
> if both support it. If no integrity has been attached to an I/O by the
> application/filesystem, the block layer will do so controlled by the
> sysfs knobs above. IOW, if the hardware is capable, protected transfers
> should always be enabled, at least from the block layer down.
> 
> It's possible that things don't work quite that way with NVMe since, at
> least for PCIe, the drive is both initiator and target. And NVMe also
> missed quite a few DIX details in its PI implementation. It's been a
> while since I messed with PI on NVMe, I'll have a look.
> 
> But in any case the intent for the Linux code was for protected
> transfers to be enabled automatically when possible. If the block layer
> protection is explicitly disabled, a filesystem can still trigger
> protected transfers via the bip flags. So that capability should
> definitely be exposed via io_uring.

I've little interest in checksum calculation offload - but protected
transfers are interesting.

bcachefs moves data around in the background (copygc, rebalance), and
whenever we move existing data we're careful to carry around the
existing checksum and revalidate it at every step, and when we have to
compute a new checksum (fragmenting an existing extent) we compute new
checksums and check that they sum up to the old checksum.

It'd be pretty cool to push this down into the storage device (and up
into the page cache as well).



More information about the Linux-nvme mailing list