nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl()
Keith Busch
kbusch at kernel.org
Tue Mar 8 11:52:38 PST 2022
On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote:
> Hello,
>
> I recently received a bug report complaining about disk corruptions when
> issuing a NVME_IOCTL_ADMIN_CMD / IDENTIFY ioctl() with cmd.data_len =
> 8192 bytes and the buffer address not aligned to the page size.
>
> This is the C program that we used to reproduce the issue (tested with
> 5.17.0-rc6): http://bsdbackstore.it/misc/nvme_ioctl_512.c
>
> simply run it by passing a path to an nvme device:
> ./nvme_ioctl_512 /dev/nvme0n1
>
> It appears to be very unpredictable. Sometimes I hit disk corruptions
> after a few tries, sometimes it takes hours. Sometimes the ioctl()
> returns success and sometimes it fails.
>
> We suspect that the root cause is that the nvme-host driver doesn't
> enforce the 4096 byte limit for the IDENTIFY commands as the
> nvme-target does (see the nvmet_execute_identify() -->
> nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code).
> So if we pass a 8192-byte buffer not aligned to the page size, it will
> need 3 pages on archs where page size is 4k and the nvme spec says
> that the data buffer may not cross more than one page boundary.
>
> Does it make sense to you? What's your opinion on this?
You are telling the driver to prepare a 3-page PRP, so it makes a PRP
list. The device knows it's a 4k payload, though, so it thinks your PRP
list pointer is actually a pointer to the data destination. The device
is corrupting that memory, which could lead to on-disk corruption if
that memory is concurrently used for a data-out command. Observing that
type of corruption is probably not deterministic.
This was an unfortunate pitfall of nvme's PRP method: the transfer
length is implicit, so both sides need to agree on that for everything
to work. If either side is mistaken on the transfer length, then you get
corruption.
In short: don't do that. If your application misuses the ioctl to break
it, you get to keep both pieces.
More information about the Linux-nvme
mailing list