[PATCH] NVMe: Shutdown fixes
Keith Busch
keith.busch at intel.com
Wed Nov 25 08:05:45 PST 2015
On Wed, Nov 25, 2015 at 12:55:37AM -0800, Christoph Hellwig wrote:
> > +static inline void nvme_end_req(struct request *req, int error)
> > +{
> > + /*
> > + * The request may already be marked "complete" by blk-mq if
> > + * completion occurs during a timeout. We set req->errors
> > + * before telling blk-mq in case this is the target request
> > + * of a timeout handler. This is safe since this driver never
> > + * completes the same command twice.
> > + */
> > + req->errors = error;
> > + blk_mq_complete_request(req, req->errors);
>
> I don't think overwriting req->errors on an already completed request
> is a good idea - it's what we used to earlier, but I specificly changed
> the blk_mq_complete_request so that we don't do that. Can you explain
> when you want to overwrite the value exactly and why?
Explanation follows:
> > if (test_bit(NVME_CTRL_RESETTING, &dev->flags)) {
> > - req->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR;
> > + dev_warn(dev->dev,
> > + "I/O %d QID %d timeout, disable controller\n",
> > + req->tag, nvmeq->qid);
> > + nvme_dev_shutdown(dev);
> > return BLK_EH_HANDLED;
>
> So here you return BLK_EH_HANDLED without setting req->errors, which
> sounds like it might be related to the above issue.
Right, we don't want to set req->errors here. nvme_dev_shutdown
unconditionally reaps every request known to the driver, so it's safe
assume the timed out request is completed and return EH_HANDLED.
Executing the nvme_timeout code path more often than not means the
controller isn't going to return that command, but we can't count on that.
So there are two possibilities:
1. request completes through nvme_process_cq
2. request completes through nvme_cancel_queue_ios
In either case, req->errors is set through the new nvme_end_req, as you
surmised. We don't want to set req->errors here directly in case the
request completes through process_cq. We should prefer the controller's
status rather than make one up.
> > + * Mark the request as handled, since the inline shutdown
> > + * forces all outstanding requests to complete.
> > */
> > - return BLK_EH_QUIESCED;
> > + return BLK_EH_HANDLED;
>
> Again we're not setting req->errors here.
Yep, same reasoning as above. This also fixes this race:
CPU 0 CPU 1
blk_mq_check_expired nvme_irq
set_bit(REQ_ATOM_COMPLETE) nvme_process_cq
nvme_timeout blk_mq_complete_request
if (!test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags) ||
test_and_clear_bit(REQ_ATOM_QUIESCED, &rq->atomic_flags))
return BLK_EH_QUIESCED
set_bit REQ_ATOM_QUIESCED
Both checks on blk_mq_complete_request will fail since the timeout
handler hasn't returned "quiesce" yet, and the request will be orphaned.
Instead we can let nvme_dev_shudtown reap the request and set req->errors
directly through one of the two paths described above, then return
BLK_EH_HANDLED.
> > @@ -2023,6 +2038,7 @@ static void nvme_dev_shutdown(struct nvme_dev *dev)
> >
> > nvme_dev_list_remove(dev);
> >
> > + mutex_lock(&dev->shutdown_lock);
>
> Shoudn't we also have a state check that skips all the work here if
> the controller has already been shut down?
This is implied through dev->bar != NULL, and the nvmeq's cq_vector value.
> > + * The pci device will be disabled if a timeout occurs while setting up
> > + * IO queues, but return 0 for "result", so we have to check both.
> > + */
> > + if (result || !pci_is_enabled(to_pci_dev(dev->dev)))
> > goto free_tags;
>
> I don't think we should return zero here in that case - that seems to
> be a bug in setting up req->errrors in the timeout handler.
To set an appropriate non-zero result, we'd need to add pci_is_enabled()
checks in nvme_setup_io_queues to distinguish a non-fatal command error
vs timeout. Is that preferable to the check where I have it?
More information about the Linux-nvme
mailing list