[PATCH] nvme: mark ctrl as DEAD if removing from error recovery

Mon Jul 10 01:23:44 PDT 2023

>>>> namespace's request queue is frozen and quiesced during error recovering,
>>>> writeback IO is blocked in bio_queue_enter(), so fsync_bdev() <-
>>>> del_gendisk()
>>>> can't move on, and causes IO hang. Removal could be from sysfs, hard
>>>> unplug or error handling.
>>>>
>>>> Fix this kind of issue by marking controller as DEAD if removal breaks
>>>> error recovery.
>>>>
>>>> This ways is reasonable too, because controller can't be recovered any
>>>> more after being removed.
>>>
>>> This looks fine to me Ming,
>>> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
>>>
>>>
>>> I still want your patches for tcp/rdma that move the freeze.
>>> If you are not planning to send them, I swear I will :)
>>
>> Ming, can you please send the tcp/rdma patches that move the
>> freeze? As I said before, it addresses an existing issue with
>> requests unnecessarily blocked on a frozen queue instead of
>> failing over.
> 
> Any chance to fix the current issue in one easy(backportable) way[1] first?

There is, you suggested one. And I'm requesting you to send a patch for
it.

> 
> All previous discussions on delay freeze[2] are generic, which apply on all
> nvme drivers, not mention this error handling difference causes extra maintain
> burden. I still suggest to convert all drivers in same way, and will work
> along the approach[1] aiming for v6.6.

But we obviously hit a difference in expectations from different
drivers. In tcp/rdma there is currently an _existing_ bug, where
we freeze the queue on error recovery, and unfreeze only after we
reconnect. In the meantime, requests can be blocked on the frozen
request queue and not failover like they should.

In fabrics the delta between error recovery and reconnect can (and
often will be) minutes or more. Hence I request that we solve _this_
issue which is addressed by moving the freeze to the reconnect path.

I personally think that pci should do this as well, and at least
dual-ported multipath pci devices would prefer instant failover
than after a full reset cycle. But Keith disagrees and I am not going to
push for it.

Regardless of anything we do in pci, the tcp/rdma transport 
freeze-blocking-failover _must_ be addressed.

So can you please submit a patch for each? Please phrase it as what
it is, a bug fix, so stable kernels can pick it up. And try to keep
it isolated to _only_ the freeze change so that it is easily
backportable.

Thanks.