kernel oops after nvme_set_queue_count()
Keith Busch
kbusch at kernel.org
Thu Jan 21 12:07:35 EST 2021
On Thu, Jan 21, 2021 at 09:25:39AM +0100, Hannes Reinecke wrote:
> Hi all,
>
> a customer of ours ran into this oops:
>
> [44157.918962] nvme nvme5: I/O 22 QID 0 timeout
> [44163.347467] nvme nvme5: Could not set queue count (880)
> [44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
> [44168.414977] BUG: unable to handle kernel paging request at
> ffff888e261e7808
> [44168.414988] IP: 0xffff888e261e7808
> [44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD
> 8000000f660001e3
>
> It's related to this code snippet in drivers/nvme/host/core.c
>
> /*
> * Degraded controllers might return an error when setting the queue
> * count. We still want to be able to bring them online and offer
> * access to the admin queue, as that might be only way to fix them up.
> */
> if (status > 0) {
> dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
> *count = 0;
>
>
> causing nvme_set_queue_count() _not_ to return an error, but rather let the
> reconnect complete.
> Of course, as this failure is due to a timeout (cf the status code; 880
> is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the
> transport, causing this crash.
This doesn't sound right. No response from a controller timeout is
supposed to get the -EINTR return, which exits earlier with a returned
error above what you're showing.
More information about the Linux-nvme
mailing list