[PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs

Sun Jun 25 01:09:45 PDT 2023

>>>>>> The point was to contain requests from entering while the hctx's are
>>>>>> being reconfigured. If you're going to pair up the freezes as you've
>>>>>> suggested, we might as well just not call freeze at all.
>>>>>
>>>>> blk_mq_update_nr_hw_queues() requires queue to be frozen.
>>>>
>>>> It's too late at that point. Let's work through a real example. You'll
>>>> need a system that has more CPU's than your nvme has IO queues.
>>>>
>>>> Boot without any special nvme parameters. Every possible nvme IO queue
>>>> will be assigned "default" hctx type. Now start IO to every queue, then
>>>> run:
>>>>
>>>>    # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
>>>>
>>>> Today, we freeze prior to tearing down the "default" IO queues, so
>>>> there's nothing entered into them while the driver reconfigures the
>>>> queues.
>>>
>>> nvme_start_freeze() just prevents new IO from being queued, and old ones
>>> may still be entering block layer queue, and what matters here is
>>> actually quiesce, which prevents new IO from being queued to
>>> driver/hardware.
>>>
>>>>
>>>> What you're suggesting will allow IO to queue up in a queisced "default"
>>>> queue, which will become "polled" without an interrupt hanlder on the
>>>> other side of the reset. The application doesn't know that, so the IO
>>>> you're allowing to queue up will time out.
>>>
>>> time out only happens after the request is queued to driver/hardware, or after
>>> blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
>>> prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
>>> meantime old requests have been canceled, so no any request can be
>>> timed out.
>>
>> Quiesce doesn't prevent requests from entering an hctx, and you can't
>> back it out to put on another hctx later. It doesn't matter that you
>> haven't dispatched it to hardware yet. The request's queue was set the
>> moment it was allocated, so after you unquiesce and freeze for the new
>> queue mapping, the requests previously blocked on quiesce will time out
>> in the scenario I've described.
>>
>> There are certainly gaps in the existing code where error'ed requests
>> can be requeued or stuck elsewhere and hit the exact same problem, but
>> the current way at least tries to contain it.
> 
> Yeah, but you can't remove the gap at all with start_freeze, that said
> the current code has to live with the situation of new mapping change
> and old request with old mapping.
> 
> Actually I considered to handle this kind of situation before, one approach
> is to reuse the bio steal logic taken in nvme mpath:
> 
> 1) for FS IO, re-submit bios, meantime free request
> 
> 2) for PT request, simply fail it
> 
> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
> always set for PT request, but not see any better approach for handling
> PT request.

Ming,

I suggest to submit patches for tcp/rdma and continue the discussion on
the pci driver.