kernel oops after nvme_set_queue_count()

Thu Jan 21 12:07:35 EST 2021

On Thu, Jan 21, 2021 at 09:25:39AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> a customer of ours ran into this oops:
> 
> [44157.918962] nvme nvme5: I/O 22 QID 0 timeout
> [44163.347467] nvme nvme5: Could not set queue count (880)
> [44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
> [44168.414977] BUG: unable to handle kernel paging request at
> ffff888e261e7808
> [44168.414988] IP: 0xffff888e261e7808
> [44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD
> 8000000f660001e3
> 
> It's related to this code snippet in drivers/nvme/host/core.c
> 
> 	/*
> 	 * Degraded controllers might return an error when setting the queue
> 	 * count.  We still want to be able to bring them online and offer
> 	 * access to the admin queue, as that might be only way to fix them up.
> 	 */
> 	if (status > 0) {
> 		dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
> 		*count = 0;
> 
> 
> causing nvme_set_queue_count() _not_ to return an error, but rather let the
> reconnect complete.
> Of course, as this failure is due to a timeout (cf the status code; 880
> is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the
> transport, causing this crash.

This doesn't sound right. No response from a controller timeout is
supposed to get the -EINTR return, which exits earlier with a returned
error above what you're showing.