[PATCH blktests] nvme: test log page offsets

Tue Feb 6 20:32:02 PST 2024

On Feb 05, 2024 / 22:41, Keith Busch wrote:
> On Tue, Feb 06, 2024 at 06:02:24AM +0000, Shinichiro Kawasaki wrote:
> > Hi Keith, thanks for the patch.
> > 
> > On Feb 05, 2024 / 10:52, Keith Busch wrote:
> > > From: Keith Busch <kbusch at kernel.org>
> > > 
> > > I've encountered a device that fails catastrophically if the host
> > > requests to an error log with a non-zero LPO. The fallout has been bad
> > > enough to warrant a sanity check against this scenario.
> > 
> > Question, which part of the kernel code does this test case cover? I'm wondering
> > if this test case might be testing NVMe devices rather than the kernel code.
> 
> This is definitely a device-side focused test. This isn't really
> exercising particularly interesting parts of the kernel/driver that are
> not already thoroughly hit with other tests.
>  
> > Also, was there any related kernel code change or discussion? If so, I would
> > like to leave links to them in the commit message.
> 
> Not a kernel change, but a tooling change. smartmontools, particularly
> the minor update package from Redhat, version 7.1-3, introduced a change
> to split large logs into multiple commands. The device doesn't just
> break itself in this scenario: every non-posted PCIe transaction times
> out after it sees an error log command with a non-zero offset that AER
> handing fails to recover, taking servers offline with it. A truly
> spectacular cascade of failure from a seemingly benign test case.

Thanks for the explanations. The issue sounds nasty. IIUC, the motivation to
add this test case is to catch the device issue early and to not lose
developer's precious time to debug it again.

Having said that, this test case will be the first test case to test *devices*,
and it will extend the role of blktests. So far, all blktests test cases are
intended to test kernel/driver *code*.

With this background, I have two questions in my mind:

Q1) What is the expected action to take when blktests users observe failure of
    the test case? Report to linux-block or linux-nvme? Or report to the
    device manufacturer?

Q2) When a new nvme driver patch needs check before submit, is this test case
    worth running?

I guess the answer of Q1 is 'Report to the device manufacturer' and the answer
of Q2 is 'No'. It would be the better to clarify these points with some more
descriptions in the comment block of the test case.

Another idea is to separate the test case from the nvme group to a new group
called "nvmedev" or "devcompat", which is dedicated to check that the storage
devices work good with Linux storage stack. It will make it easier for the
blktests users to understand the action to take, and which test group to run.
What do you think?

...

> 
> > I ran this test case on my test system using QEMU NVME device, and saw it failed
> > with the message below.
> > 
> > nvme/051 => nvme0n1 (Tests device support for log page offsets) [failed]
> >     runtime  0.104s  ...  0.126s
> >     --- tests/nvme/051.out      2024-02-06 09:46:03.522522896 +0900
> >     +++ /home/shin/Blktests/blktests/results/nvme0n1/nvme/051.out.bad   2024-02-06 14:50:57.394105192 +0900
> >     @@ -1,2 +1,3 @@
> >      Running nvme/051
> >     +NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x4002)
> >      Test complete
> >
> > I took a look in the latest QEMU code, and found it returns "Invalid Field in
> > Command" when the specified offset is larger than error log size in QEMU. Do I
> > miss anything to make this test case pass?
> 
> Ah, good catch. I don't want to fail the test if the device correctly
> reports an error, so I'd need to redirect 2>&1. Success for this test is
> just if the command completes at all with any status.

Got it, thanks for the clarification.