kernel oops after nvme_set_queue_count()

Thu Jan 21 04:06:22 EST 2021

> Hi all,
> 
> a customer of ours ran into this oops:
> 
> [44157.918962] nvme nvme5: I/O 22 QID 0 timeout
> [44163.347467] nvme nvme5: Could not set queue count (880)
> [44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
> [44168.414977] BUG: unable to handle kernel paging request at 
> ffff888e261e7808
> [44168.414988] IP: 0xffff888e261e7808
> [44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD 
> 8000000f660001e3
> 
> It's related to this code snippet in drivers/nvme/host/core.c
> 
>      /*
>       * Degraded controllers might return an error when setting the queue
>       * count.  We still want to be able to bring them online and offer
>       * access to the admin queue, as that might be only way to fix them 
> up.
>       */
>      if (status > 0) {
>          dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
>          *count = 0;
> 
> 
> causing nvme_set_queue_count() _not_ to return an error, but rather let 
> the reconnect complete.
> Of course, as this failure is due to a timeout (cf the status code; 880
> is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the 
> transport, causing this crash.
> 
> So, question: _why_ do we ignore the status?

This used to exist in pci where a controller reset will fail to set up
I/O queues, at least the controller can accept admin commands to get
some diagnostics (perhaps an error log page).

> For fabrics I completely fail to see the reason here; even _if_ it 
> worked we would end up with a connection for which just the admin queue 
> is operable, the state is LIVE, and all information we could glance 
> would indicate that the connection is perfectly healthy.

We also had a ADMIN_ONLY state at some point, but that was dropped as
well for reasons I don't remember at the moment.

> It just doesn't have any I/O queues.
> Which will lead to some very confused customers and some very unhappy 
> support folks trying to figure out what has happened.
> 
> Can we just kill this statement and always return an error?
> In all other cases we are quite trigger-happy with controller reset; why 
> not here?

I think we will want to keep the existing behavior for pci, but agree we
probably want to change it for fabrics...