[PATCH] nvme: do not ignore nvme status in nvme_set_queue_count()

Tue Jan 26 14:06:54 EST 2021

On Tue, Jan 26, 2021 at 04:25:11PM +0100, Hannes Reinecke wrote:
> On 1/22/21 5:44 PM, Keith Busch wrote:
> > On Fri, Jan 22, 2021 at 05:35:35PM +0100, Hannes Reinecke wrote:
> > > On 1/21/21 9:14 PM, Keith Busch wrote:
> > > > On Thu, Jan 21, 2021 at 10:50:21AM +0100, Hannes Reinecke wrote:
> > > > > If the call to nvme_set_queue_count() fails with a status we should
> > > > > not ignore it but rather pass it on to the caller.
> > > > > It's then up to the transport to decide whether to ignore it
> > > > > (like PCI does) or to reset the connection (as would be appropriate
> > > > > for fabrics).
> > > > 
> > > > Instead of checking the error, wouldn't checking the number of created
> > > > queues be sufficient? What handling difference do you expect to occur
> > > > between getting a success with 0 queues, vs getting an error?
> > > > 
> > > The difference is that an error will (re-)start recovery, 0 queues won't.
> > > But the problem here is that nvme_set_queue_count() is being called during
> > > reconnection, ie during the recovery process itself.
> > > And this command is returned with a timeout, which in any other case is
> > > being treated as a fatal error. Plus we have been sending this command on
> > > the admin queue, so a timeout on the admin queue pretty much _is_  a fatal
> > > error. So we should be terminating the current recovery and reconnect. None
> > > of that will happen if we return '0' queues.
> > 
> > You should already be getting an error return status if a timeout occurs
> > for nvme_set_queue_count(), specifically -EINTR. Are you getting success
> > for some reason?
> > 
> -EINTR (which translates to 'nvme_req(req)->flags & NVME_REQ_CANCELLED')
> will only ever be returned on pci; fabrics doesn't set this flag, so we're
> never getting an -EINTR.

Sounds like that's the problem that needs to be fixed.