[PATCH 0/4] nvme-blkmq fixes
Jens Axboe
axboe at fb.com
Tue Dec 23 09:49:11 PST 2014
On 12/22/2014 06:34 PM, Keith Busch wrote:
> On Mon, 22 Dec 2014, Keith Busch wrote:
>> On Mon, 22 Dec 2014, Jens Axboe wrote:
>>> Should be enough to just check for ->rq_pool being initialized or not
>>> - if it is, we could have waiters and we know the waitqueues have
>>> been setup, etc.
>>>
>>> V2 attached.
>>
>> Yep, that fixes the bug.
>>
>> I'm not sure I follow your suggestion for forcing bt_get() to abandon
>> allocating a request tag when the queue is dying. If hctx_may_queue()
>> fails, it returns a generic error and bt_get() reschedules itself. Should
>> a different error than -1 be returned if the queue is dying?
>
> We're making good incremental improvements, but finding oddities the
> more I test this. This one's a doozy.
>
> Requeued IO's are automatically dispatched, and I don't see an immediately
> available way stop them. It causes a bug because the queue doorbells are
> unmapped during reset, so you can't touch them when the queue should be
> quiesced. I could fix that by having the driver not kick the requeue_list
> when it knows a reset is in progress, but there's no immediate way
> to drain the list if the reset fails and the device requires removal,
> and blk_cleanup_queue() will be stuck.
>
> Is there something available to call that I'm missing or do I need to
> add more removal handling?
So that's actually a case where having the queues auto-started on
requeue run is harmful, since we should be able to handle this situation
by stopping queues, requeueing, and then having a helper to eventually
abort pending requeued work, if we have to. But if you simply requeue
them and defer kicking the requeue list it might work. At that point
you'd either kick the requeues (and hence start processing them) if
things went well on the reset, or we could have some
blk_mq_abort_requeues() helper that'd kill them with -EIO instead. Would
that work for you?
--
Jens Axboe
More information about the Linux-nvme
mailing list