kernel oops after nvme_set_queue_count()

Thu Jan 21 03:25:39 EST 2021

Hi all,

a customer of ours ran into this oops:

[44157.918962] nvme nvme5: I/O 22 QID 0 timeout
[44163.347467] nvme nvme5: Could not set queue count (880)
[44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
[44168.414977] BUG: unable to handle kernel paging request at 
ffff888e261e7808
[44168.414988] IP: 0xffff888e261e7808
[44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD 
8000000f660001e3

It's related to this code snippet in drivers/nvme/host/core.c

	/*
	 * Degraded controllers might return an error when setting the queue
	 * count.  We still want to be able to bring them online and offer
	 * access to the admin queue, as that might be only way to fix them up.
	 */
	if (status > 0) {
		dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
		*count = 0;

causing nvme_set_queue_count() _not_ to return an error, but rather let 
the reconnect complete.
Of course, as this failure is due to a timeout (cf the status code; 880
is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the 
transport, causing this crash.

So, question: _why_ do we ignore the status?

For fabrics I completely fail to see the reason here; even _if_ it 
worked we would end up with a connection for which just the admin queue 
is operable, the state is LIVE, and all information we could glance 
would indicate that the connection is perfectly healthy.
It just doesn't have any I/O queues.
Which will lead to some very confused customers and some very unhappy 
support folks trying to figure out what has happened.

Can we just kill this statement and always return an error?
In all other cases we are quite trigger-happy with controller reset; why 
not here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer