[PATCH] nvme: mark ctrl as DEAD if removing from error recovery

Mon Jul 10 04:01:18 PDT 2023

On 7/10/23 12:50, Ming Lei wrote:
> On Mon, Jul 10, 2023 at 12:27:31PM +0300, Sagi Grimberg wrote:
>>
>>>>>>> I still want your patches for tcp/rdma that move the freeze.
>>>>>>> If you are not planning to send them, I swear I will :)
>>>>>>
>>>>>> Ming, can you please send the tcp/rdma patches that move the
>>>>>> freeze? As I said before, it addresses an existing issue with
>>>>>> requests unnecessarily blocked on a frozen queue instead of
>>>>>> failing over.
>>>>>
>>>>> Any chance to fix the current issue in one easy(backportable) way[1] first?
>>>>
>>>> There is, you suggested one. And I'm requesting you to send a patch for
>>>> it.
>>>
>>> The patch is the one pointed by link [1], and it still can be applied on current
>>> linus tree.
>>>
>>> https://lore.kernel.org/linux-nvme/20230629064818.2070586-1-ming.lei@redhat.com/
>>
>> This is separate from what I am talking about.
>>
>>>>> All previous discussions on delay freeze[2] are generic, which apply on all
>>>>> nvme drivers, not mention this error handling difference causes extra maintain
>>>>> burden. I still suggest to convert all drivers in same way, and will work
>>>>> along the approach[1] aiming for v6.6.
>>>>
>>>> But we obviously hit a difference in expectations from different
>>>> drivers. In tcp/rdma there is currently an _existing_ bug, where
>>>> we freeze the queue on error recovery, and unfreeze only after we
>>>> reconnect. In the meantime, requests can be blocked on the frozen
>>>> request queue and not failover like they should.
>>>>
>>>> In fabrics the delta between error recovery and reconnect can (and
>>>> often will be) minutes or more. Hence I request that we solve _this_
>>>> issue which is addressed by moving the freeze to the reconnect path.
>>>>
>>>> I personally think that pci should do this as well, and at least
>>>> dual-ported multipath pci devices would prefer instant failover
>>>> than after a full reset cycle. But Keith disagrees and I am not going to
>>>> push for it.
>>>>
>>>> Regardless of anything we do in pci, the tcp/rdma transport
>>>> freeze-blocking-failover _must_ be addressed.
>>>
>>> It is one generic issue, freeze/unfreeze has to be paired strictly
>>> for every driver.
>>>
>>> For any nvme driver, the inbalance can happen when error handling
>>> is involved, that is why I suggest to fix the issue in one generic
>>> way.
>>
>> Ming, you are ignoring what I'm saying. I don't care if the
>> freeze/unfreeze is 100% balanced or not (for the sake of this
>> discussion).
>>
>> I'm talking about a _separate_ issue where a queue
>> is frozen for potentially many minutes blocking requests that
>> could otherwise failover.
>>
>>>> So can you please submit a patch for each? Please phrase it as what
>>>> it is, a bug fix, so stable kernels can pick it up. And try to keep
>>>> it isolated to _only_ the freeze change so that it is easily
>>>> backportable.
>>>
>>> The patch of "[PATCH V2] nvme: mark ctrl as DEAD if removing from error
>>> recovery" can fix them all(include nvme tcp/fc's issue), and can be backported.
>>
>> Ming, this is completely separate from what I'm talking about. This one
>> is addressing when the controller is removed, while I'm talking about
>> the error-recovery and failover, which is ages before the controller is
>> removed.
>>
>>> But as we discussed, we still want to call freeze/unfreeze in pair, and
>>> I also suggest the following approach[2], which isn't good to backport:
>>>
>>> 	1) moving freeze into reset
>>> 	
>>> 	2) during resetting
>>> 	
>>> 	- freeze NS queues
>>> 	- unquiesce NS queues
>>> 	- nvme_wait_freeze()
>>> 	- update_nr_hw_queues
>>> 	- unfreeze NS queues
>>> 	
>>> 	3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
>>> 	
>>> 	- if the request is FS IO with data, re-submit all bios of this request, and free the request
>>> 	
>>> 	- otherwise, fail the request
>>>
>>>
>>> [2] https://lore.kernel.org/linux-block/5bddeeb5-39d2-7cec-70ac-e3c623a8fca6@grimberg.me/T/#mfc96266b63eec3e4154f6843be72e5186a4055dc
>>
>> Ming, please read again what my concern is. I'm talking about error recovery
>> freezing a queue, and unfreezing only after we reconnect,
>> blocking requests that should failover.
> 
>  From my understanding, nothing is special for tcp/rdma compared with
> nvme-pci.

But there is... The expectations are different.

> All take two stage error recovery: teardown & [reset(nvme-pci) | reconnect(tcp/rdma)]
> 
> Queues are frozen during teardown, and unfreeze in reset or reconnect.
> 
> If the 2nd stage is failed or bypassed, queues could be left as frozen
> & unquisced, and requests can't be handled, and io hang.
> 
> When tcp reconnect failed, nvme_delete_ctrl() is called for failing
> requests & removing controller.
> 
> Then the patch of "nvme: mark ctrl as DEAD if removing from error recovery"
> can avoid this issue by calling blk_mark_disk_dead() which can fail any
> request pending in bio_queue_enter().
> 
> If that isn't tcp/rdma's issue, can you explain it in details?

Yes I can. The expectation in pci is that a reset lifetime will be a few
seconds. By lifetime I mean it starts and either succeeds or fails.

in fabrics the lifetime is many minutes. i.e. it starts when error
recovery kicks in, and either succeeds (reconnected) or fails (deleted
due to ctrl_loss_tmo). This can take a long period of time (if for
example the controller is down for maintenance/reboot).

Hence, while it may be acceptable that requests are blocked on
a frozen queue for the duration of a reset in pci, it is _not_
acceptable in fabrics. I/O should failover far sooner than that
and must _not_ be dependent on the success/failure or the reset.

Hence I requested that this is addressed specifically for fabrics
(tcp/rdma).

Now, I did mention that I think that pci should behave similar to this,
but that is arguable and is only my opinion, and should be separate from
fixing the issue noted above for tcp/rdma.