[PATCH blktests] nvme: test log page offsets
Keith Busch
kbusch at kernel.org
Mon Feb 5 22:41:03 PST 2024
On Tue, Feb 06, 2024 at 06:02:24AM +0000, Shinichiro Kawasaki wrote:
> Hi Keith, thanks for the patch.
>
> On Feb 05, 2024 / 10:52, Keith Busch wrote:
> > From: Keith Busch <kbusch at kernel.org>
> >
> > I've encountered a device that fails catastrophically if the host
> > requests to an error log with a non-zero LPO. The fallout has been bad
> > enough to warrant a sanity check against this scenario.
>
> Question, which part of the kernel code does this test case cover? I'm wondering
> if this test case might be testing NVMe devices rather than the kernel code.
This is definitely a device-side focused test. This isn't really
exercising particularly interesting parts of the kernel/driver that are
not already thoroughly hit with other tests.
> Also, was there any related kernel code change or discussion? If so, I would
> like to leave links to them in the commit message.
Not a kernel change, but a tooling change. smartmontools, particularly
the minor update package from Redhat, version 7.1-3, introduced a change
to split large logs into multiple commands. The device doesn't just
break itself in this scenario: every non-posted PCIe transaction times
out after it sees an error log command with a non-zero offset that AER
handing fails to recover, taking servers offline with it. A truly
spectacular cascade of failure from a seemingly benign test case.
FWIW, while the device handling is completely wrong, this log has read
side effects, so an application reading non-zero offsets is pretty much
nonsense (and unfortunately nvme-cli does similiar things too). I've
filed a bug report on smartmontools' github. I believe someone else will
be taking this up with Redhat because the backports from 7.2 to the
minor 7.1-3 update are wrong anyway.
> I ran this test case on my test system using QEMU NVME device, and saw it failed
> with the message below.
>
> nvme/051 => nvme0n1 (Tests device support for log page offsets) [failed]
> runtime 0.104s ... 0.126s
> --- tests/nvme/051.out 2024-02-06 09:46:03.522522896 +0900
> +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/051.out.bad 2024-02-06 14:50:57.394105192 +0900
> @@ -1,2 +1,3 @@
> Running nvme/051
> +NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x4002)
> Test complete
>
> I took a look in the latest QEMU code, and found it returns "Invalid Field in
> Command" when the specified offset is larger than error log size in QEMU. Do I
> miss anything to make this test case pass?
Ah, good catch. I don't want to fail the test if the device correctly
reports an error, so I'd need to redirect 2>&1. Success for this test is
just if the command completes at all with any status.
More information about the Linux-nvme
mailing list