nvme-format: protection information enabled although metadata size is 0

Wed Nov 2 12:47:56 PDT 2022

On Wed, Nov 02, 2022 at 08:32:19PM +0100, Binarus wrote:
> On 02.11.2022 16:59, Keith Busch wrote:
> > though, so I'll add that this particular model's does not work with the
> > Linux kernel's end-to-end protection. This device supports only the
> > "extended" metadata, not the "separate" that the Linux block stack
> > requires. You won't be able to use the generic block layer for IO with
> > protection information, but you should be able to use it in passthrough
> > modes. And if you are using the 8-byte format (LBAF 4, I believe), then
> > the driver will have the device strip/generate PI without the host ever
> > seeing it.
> 
> I have a vague notion of the metadata types, and have recognized something
> which worries me even more:
> 
> In the datasheet / manual for the P3700 from October 2015 (newest version I
> could find), in table 34 on page 38 which describes the Identify Namespace
> data structure, it clearly says that byte 27 will report value 0x3, which
> means that both metadata types (extended and separate) are supported. From
> the "Interpretation" column of the "MC" row:
> 
> "Indicated support for metadata transferred with the extended data LBA and
> in separate buffer - both are supported."
> 
> However, when I execute nvme id-ns /dev/nvme0n1 on the machine in question,
> it shows the value 0x1 for the MC, which means that it supports only the
> extended LBA metadata.
> 
> That means the either the datasheet / manual or nvme is wrong. I guess that
> the former is the case, and your statement supports that.

Your data sheet is wrong. This family of controllers never supported
anything but interleaved metadata.

> I had absolutely no clue that the standard Linux IO does not support
> extended LBA metadata, and thus does not support extended LBA PI. That's
> quite disappointing.

How could it be supported? The format requires data+metadata be
virtually contiguous, but that's impossible from the user app that only
provides the data.

The only option would be for the kernel to bounce it through a new
buffer, and that's more horrible than it sounds, not to mention a
complete disaster for memory reclaim. This was my last attempt at it:

  http://lists.infradead.org/pipermail/linux-nvme/2018-February/015844.html

> Currently, I don't know what the passthrough mode you
> have mentioned is, but I'll research it.

>From user space, you'd have to use ioctl NVME_IOCTL_IO_CMD instead of
normal read/write.

> Perhaps I am using it already, because the SSD in question acts as a cache
> device in a ZFS pool. Since ZFS circumvents the normal I/O layer at some
> places, maybe it can use extended LBA PI.

Kernel space can also issue passthrough commands if they really really
want to via REQ_DRV_IN/OUT requests, but I seriously doubt that's
happening. That'd be quite fragile for an out-of-tree filesystem to
attempt.