[PATCH] NVMe: Shutdown fixes

Wed Nov 25 08:57:03 PST 2015

On Wed, Nov 25, 2015 at 08:42:57AM -0800, Christoph Hellwig wrote:
> >         CPU 0                           CPU 1
> > 
> >   blk_mq_check_expired                nvme_irq
> >     set_bit(REQ_ATOM_COMPLETE)          nvme_process_cq
> >     nvme_timeout                          blk_mq_complete_request
> >                                             if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags) ||
> > 	                                         test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags))
> >       return BLK_EH_QUIESCED
> >     set_bit REQ_ATOM_QUIESCED
> >
> > Both checks on blk_mq_complete_request will fail since the timeout
> > handler hasn't returned "quiesce" yet, and the request will be orphaned.
> 
> BLK_EH_QUIESCED is never returned with your patch applied..

I'm just showing why it was broken before. My patch fixes it since we
reap the command and return BLK_EH_HANDLED.

> > Instead we can let nvme_dev_shudtown reap the request and set req->errors
> > directly through one of the two paths described above, then return
> > BLK_EH_HANDLED.
> 
> nvme_dev_shutdown will not complete the command that we timed out,
> blk_mq_complete_request will skip it because REQ_ATOM_COMPLETE is set,
> and blk_mq_rq_timed_out will complete after we returned from the timeout
> handler.

I'm not saying nvme_dev_shutdown "completes" the command. I'm just
saying it reaps it and sets an appropriate req->errors. The actual blk-mq
completion happens as you described through nvme_timeout's return code
of BLK_EH_HANDLED.

> > This is implied through dev->bar != NULL, and the nvmeq's cq_vector value.
> 
> Well, we're not doing anything in the low-level functions at the moment
> (at the checks in nvme_dev_unmap, and all requests with
> REQ_ATOM_COMPLETE to the list), but it still seems rather fragile to
> rely on these low level functions deep down the stack to get everything
> right.

Agreed.

> > To set an appropriate non-zero result, we'd need to add pci_is_enabled()
> > checks in nvme_setup_io_queues to distinguish a non-fatal command error
> > vs timeout. Is that preferable to the check where I have it?
> 
> No, that's not my preference.  My preference is to figure out why you
> get a zero req->errors on timed request.  That really shouldn't happen
> to start with.

req->errors wouldn't be 0, but nvme_set_queue_count returns "0" queues
if req->errors is non-zero. It's not a fatal error to have 0 queues so
we don't want to propogate that error to the initialization unless it's
a timeout.

We also don't return error codes on create sq/cq commands because an error
here isn't fatal either unless, of course, it's a timeout.