[PATCH] NVMe: Shutdown fixes

Wed Nov 25 08:05:45 PST 2015

On Wed, Nov 25, 2015 at 12:55:37AM -0800, Christoph Hellwig wrote:
> > +static inline void nvme_end_req(struct request *req, int error)
> > +{
> > +	/*
> > +	 * The request may already be marked "complete" by blk-mq if
> > +	 * completion occurs during a timeout. We set req->errors
> > +	 * before telling blk-mq in case this is the target request
> > +	 * of a timeout handler. This is safe since this driver never
> > +	 * completes the same command twice.
> > +	 */
> > +	req->errors = error;
> > +	blk_mq_complete_request(req, req->errors);
> 
> I don't think overwriting req->errors on an already completed request
> is a good idea - it's what we used to earlier, but I specificly changed
> the blk_mq_complete_request so that we don't do that.  Can you explain
> when you want to overwrite the value exactly and why?

Explanation follows:

> >  	if (test_bit(NVME_CTRL_RESETTING, &dev->flags)) {
> > -		req->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR;
> > +		dev_warn(dev->dev,
> > +				"I/O %d QID %d timeout, disable controller\n",
> > +				 req->tag, nvmeq->qid);
> > +		nvme_dev_shutdown(dev);
> >  		return BLK_EH_HANDLED;
> 
> So here you return BLK_EH_HANDLED without setting req->errors, which
> sounds like it might be related to the above issue.

Right, we don't want to set req->errors here. nvme_dev_shutdown
unconditionally reaps every request known to the driver, so it's safe
assume the timed out request is completed and return EH_HANDLED.

Executing the nvme_timeout code path more often than not means the
controller isn't going to return that command, but we can't count on that.
So there are two possibilities:

  1. request completes through nvme_process_cq
  2. request completes through nvme_cancel_queue_ios

In either case, req->errors is set through the new nvme_end_req, as you
surmised. We don't want to set req->errors here directly in case the
request completes through process_cq. We should prefer the controller's
status rather than make one up.

> > +		 * Mark the request as handled, since the inline shutdown
> > +		 * forces all outstanding requests to complete.
> >  		 */
> > -		return BLK_EH_QUIESCED;
> > +		return BLK_EH_HANDLED;
> 
> Again we're not setting req->errors here.

Yep, same reasoning as above. This also fixes this race:

        CPU 0                           CPU 1

  blk_mq_check_expired                nvme_irq
    set_bit(REQ_ATOM_COMPLETE)          nvme_process_cq
    nvme_timeout                          blk_mq_complete_request
                                            if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags) ||
	                                         test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags))
      return BLK_EH_QUIESCED
    set_bit REQ_ATOM_QUIESCED

Both checks on blk_mq_complete_request will fail since the timeout
handler hasn't returned "quiesce" yet, and the request will be orphaned.

Instead we can let nvme_dev_shudtown reap the request and set req->errors
directly through one of the two paths described above, then return
BLK_EH_HANDLED.

> > @@ -2023,6 +2038,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
> >  
> >  	nvme_dev_list_remove(dev);
> >  
> > +	mutex_lock(&dev->shutdown_lock);
> 
> Shoudn't we also have a state check that skips all the work here if
> the controller has already been shut down?

This is implied through dev->bar != NULL, and the nvmeq's cq_vector value.

> > +	 * The pci device will be disabled if a timeout occurs while setting up
> > +	 * IO queues, but return 0 for "result", so we have to check both.
> > +	 */
> > +	if (result || !pci_is_enabled(to_pci_dev(dev->dev)))
> >  		goto free_tags;
>
> I don't think we should return zero here in that case - that seems to
> be a bug in setting up req->errrors in the timeout handler.

To set an appropriate non-zero result, we'd need to add pci_is_enabled()
checks in nvme_setup_io_queues to distinguish a non-fatal command error
vs timeout. Is that preferable to the check where I have it?