[PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING

Tue Sep 12 05:59:25 PDT 2023

On 9/12/23 14:15, Sagi Grimberg wrote:
> 
>>>> When the error recovery started it will terminate all commands
>>>> anyway, so we should just return BLK_EH_RESET_TIMER for a
>>>> command timeout.
>>>
>>> Does this fix a real issue? the timeout handler and the error
>>> recovery should be mutually exclusive.
>>>
>> They are not. When several timeouts are triggering at the same time,
>> only the first will start error recovery; the remaining ones will
>> still hit the timeout function until cancel_tagset() is called.
> 
> Yes, but they do not access any resources unsynchronized. Meaning
> is this a use-after-free issue?
> 
This patch itself is really just an optimization. Sure the current
version works, but point is that we don't need to cancel the request
manually as they'll be handled by cancel_tagset() anyway.

>>
>>>> Signed-off-by: Hannes Reinecke <hare at suse.de>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>>>   1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 5b332d9f87fc..1faef1cf5c94 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return 
>>>> nvme_tcp_timeout(struct request *rq)
>>>>           nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
>>>>           opc, nvme_opcode_str(qid, opc, fctype));
>>>> +    /*
>>>> +     * If the error recovery is started all commands will be
>>>> +     * aborted anyway, and nothing is to be done here.
>>>> +     */
>>>> +    if (ctrl->state == NVME_CTRL_RESETTING &&
>>>> +        work_pending(&to_tcp_ctrl(ctrl)->err_work))
>>>> +        return BLK_EH_RESET_TIMER;
>>>
>>> What does the work_pending() check give you here?
>>>
>> 'RESETTING' is also entered for firmware download, so we need to check
>> for work_pending() to find out if the error recovery engaged.
> 
> firmware download? what happens if err_work is yet to run when
> this request times out?
> 
But that's fine, as it's queued, and as such will be run eventually.
It'll have a delay until it's completed.

> Not sure what this is protecting against.

At this time, against nothing. It's really a prep patch for moving the
error recovery onto a delayed workqueue.

Cheers,

Hannes