nvme-host: disk corruptions when issuing IDENTIFY commands via ioctl()
Ming Lei
ming.lei at redhat.com
Tue Mar 8 16:18:47 PST 2022
On Tue, Mar 08, 2022 at 11:52:38AM -0800, Keith Busch wrote:
> On Tue, Mar 08, 2022 at 05:45:20PM +0100, Maurizio Lombardi wrote:
> > Hello,
> >
> > I recently received a bug report complaining about disk corruptions when
> > issuing a NVME_IOCTL_ADMIN_CMD / IDENTIFY ioctl() with cmd.data_len =
> > 8192 bytes and the buffer address not aligned to the page size.
> >
> > This is the C program that we used to reproduce the issue (tested with
> > 5.17.0-rc6): http://bsdbackstore.it/misc/nvme_ioctl_512.c
> >
> > simply run it by passing a path to an nvme device:
> > ./nvme_ioctl_512 /dev/nvme0n1
> >
> > It appears to be very unpredictable. Sometimes I hit disk corruptions
> > after a few tries, sometimes it takes hours. Sometimes the ioctl()
> > returns success and sometimes it fails.
> >
> > We suspect that the root cause is that the nvme-host driver doesn't
> > enforce the 4096 byte limit for the IDENTIFY commands as the
> > nvme-target does (see the nvmet_execute_identify() -->
> > nvmet_check_transfer_len(req, NVME_IDENTIFY_DATA_SIZE) code).
> > So if we pass a 8192-byte buffer not aligned to the page size, it will
> > need 3 pages on archs where page size is 4k and the nvme spec says
> > that the data buffer may not cross more than one page boundary.
> >
> > Does it make sense to you? What's your opinion on this?
>
> You are telling the driver to prepare a 3-page PRP, so it makes a PRP
> list. The device knows it's a 4k payload, though, so it thinks your PRP
> list pointer is actually a pointer to the data destination. The device
> is corrupting that memory, which could lead to on-disk corruption if
> that memory is concurrently used for a data-out command. Observing that
> type of corruption is probably not deterministic.
>
> This was an unfortunate pitfall of nvme's PRP method: the transfer
> length is implicit, so both sides need to agree on that for everything
> to work. If either side is mistaken on the transfer length, then you get
> corruption.
>
> In short: don't do that. If your application misuses the ioctl to break
> it, you get to keep both pieces.
Given NVMe spec states that data length of IDENTIFY command should be
4096bytes, and PRP list can't be used.
So looks nvme driver need to validate the command before submitting to
hardware, otherwise any buggy application can break FS or memory easily.
Thanks,
Ming
More information about the Linux-nvme
mailing list