[PATCH] nvme: do not ignore nvme status in nvme_set_queue_count()

Fri Jan 22 11:44:54 EST 2021

On Fri, Jan 22, 2021 at 05:35:35PM +0100, Hannes Reinecke wrote:
> On 1/21/21 9:14 PM, Keith Busch wrote:
> > On Thu, Jan 21, 2021 at 10:50:21AM +0100, Hannes Reinecke wrote:
> > > If the call to nvme_set_queue_count() fails with a status we should
> > > not ignore it but rather pass it on to the caller.
> > > It's then up to the transport to decide whether to ignore it
> > > (like PCI does) or to reset the connection (as would be appropriate
> > > for fabrics).
> > 
> > Instead of checking the error, wouldn't checking the number of created
> > queues be sufficient? What handling difference do you expect to occur
> > between getting a success with 0 queues, vs getting an error?
> > 
> The difference is that an error will (re-)start recovery, 0 queues won't.
> But the problem here is that nvme_set_queue_count() is being called during
> reconnection, ie during the recovery process itself.
> And this command is returned with a timeout, which in any other case is
> being treated as a fatal error. Plus we have been sending this command on
> the admin queue, so a timeout on the admin queue pretty much _is_  a fatal
> error. So we should be terminating the current recovery and reconnect. None
> of that will happen if we return '0' queues.

You should already be getting an error return status if a timeout occurs
for nvme_set_queue_count(), specifically -EINTR. Are you getting success
for some reason?