[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Tue Oct 22 06:28:47 PDT 2024

On 10/22/24 15:13, Sagi Grimberg wrote:
> 
> 
> 
> On 22/10/2024 9:54, Nilay Shroff wrote:
>>
>> On 10/21/24 20:44, Sagi Grimberg wrote:
>>>
>>>
>>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>>> Hey Nilay,
>>>>>
>>>>> First, can you please phrase the patch title to describe what it
>>>>> is addressing, not what it is changing, that can come (as it does)
>>>>> in the body.
>>>>>
>>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>>
>>>>> Second, I don't think we want to spend a thread waiting for
>>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>>> way that this can be done?
>>>>>
>>>>> Third, was this observed with non-loop transports?
>>>>>
>>>> Yes it could happen even with non-loop fabric transports.
>>>>
>>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>>> This may lead to a race between the fabric controller admin queue
>>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>>> operation. This race could lead to the kernel crash shown below:
>>>>> What is different about keep-alive than any other command?
>>>> The one difference is, of course, keep-alive is async but we may have
>>>> other async requests running while shutting down the controller. I found
>>>> that, when compared the keep-alive with other commands, the keep-alive
>>>> runs in the worker thread context but other commands (for instance async
>>>> read/write) runs in the user context and that uses the blk plugging. When we
>>>> use plugging, the call path for issuing request to driver is quite different
>>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>>> difference is probably the reason not causing the kernel crash issuing
>>>> read/write command while shutting down the fabric controller.
>>>>
>>>> The call path for blk plugging for issuing the request to driver:
>>>> __blk_flush_plug()
>>>>     -> blk_mq_flush_plug_list()
>>>>       ->blk_mq_dispatch_plug_list()
>>>>
>>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>>> function) is protected from the queue being destroyed on other cpu thread.
>>>> So that means that even after a request (assuming the only request in queue)
>>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>>> blk-mq destroy code path would not be able to forward progress until the
>>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>>> the blk_mq_run_hw_queue function returns.
>>> Sounds like an issue that is not related to keep-alive at all.
>>>
>>>>>> Call Trace:
>>>>>>        autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>>        __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>>        blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>>        blk_mq_run_hw_queue+0x140/0x220
>>>>>>        nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>>        process_one_work+0x200/0x4e0
>>>>>>        worker_thread+0x340/0x504
>>>>>>        kthread+0x138/0x140
>>>>>>        start_kernel_thread+0x14/0x18
>>>>> What is the "crash" here? use-after-free?
>>>> yes it's use-after-free kernel crash.
>>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>>> then invoked to handle the end of the keep-alive operation which
>>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>>> queue()) which could be potentially running simultaneously on another
>>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>>> we are not supposed to access the admin queue resources. However the
>>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>>> potentially access the admin queue resource while the admin queue had
>>>>>> been already deleted and that causes the above crash.
>>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>>> race with the shutdown, and it can be sync or async.
>>>>>
>>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>>> probably the only async request running in worker thread context and not well
>>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>>> async request (for instance async read/write) which might get-in the way of the
>>>> fabric controller shutdown code path but those mostly run in the user context and
>>>> hence have a different code path than keep-alive command for submitting/queuing
>>>> the request.
>>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>>> from a kthread?
>> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
>> but the side effect it causes is what I just wanted to highlight.
> 
> The point is that any kthread using it would potentially trigger this bug. Hence we should
> treat the disease and not the symptom.
> 
>>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>>> after keep-alive command finished its execution and returns the command
>>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>>> running hw/hctx queue dispatch operation.
>>>>> I'd want to see exactly what is causing this race because we already flush
>>>>> the keep alive work when starting the shutdown... so it only means it is
>>>>> incorrectly firing again?
>>>>>
>>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>>> alive work before making progress with the teardown, we should just make sure
>>>>> that it doesn't triggered afterwards again if it does.
>>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>>> allocated but before it's queued,
>>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>>
>>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>>> something doesn't add up
>>>
>> This issue manifests when the keep-alive command sneaks in after the admin queue is
>> unquiesced but before its freezed.
>>
>> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>>
>> <<< here keep-alive sneaks in and it starts running dispatcher
>>
>> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>>
>> So as mentioned above, we have time window between queue unquiesce and freeze
>> operations where async keep-alive could potentially sneaks in and causing the
>> observed symptom.
> 
> The problem here is that keep-alive should not be able to sneak in after we unquiescence
> the admin queue.
> 
> What I'm seeing in the code is that the teardown flow does not call nvme_stop_keep_alive()
> in the teardown path.
> 
> This seems to be caused by [1], so it makes sense that we can even trigger this race.
> If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() which is called
> way before we quiesce/unquiesce the queue, then this issue would not have been encountered.
> 
> I think we could make an argument that both call-sites to nvme_stop_keep_alive() should co-exist
> with a little documentation to why.
> Can you try and reproduce with [2]?
> 
> 
> [1]:
> -- 
> commit a54a93d0e3599b05856971734e15418ac551a14c
> Author: Ming Lei <ming.lei at redhat.com>
> Date:   Tue Aug 13 09:35:27 2024 +0800
> 
>     nvme: move stopping keep-alive into nvme_uninit_ctrl()
> 
>     Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
>     moves starting keep-alive from nvme_start_ctrl() into
>     nvme_init_ctrl_finish(), but don't move stopping keep-alive into
>     nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
>     after failing to start controller, finally use-after-free is triggered if
>     nvme host driver is unloaded.
> 
>     This patch fixes kernel panic when running nvme/004 in case that connection
>     failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().
> 
>     This way is reasonable because keep-alive is now started in
>     nvme_init_ctrl_finish().
> 
>     Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>     Cc: Hannes Reinecke <hare at suse.de>
>     Cc: Mark O'Donovan <shiftee at posteo.net>
>     Reported-by: Changhui Zhong <czhong at redhat.com>
>     Reviewed-by: Christoph Hellwig <hch at lst.de>
>     Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
>     Reviewed-by: Hannes Reinecke <hare at suse.de>
>     Signed-off-by: Ming Lei <ming.lei at redhat.com>
>     Signed-off-by: Keith Busch <kbusch at kernel.org>
> -- 
> 
> This patch was introduced as a fix for:
> 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
> 
> [2]:
> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0dc8bcc664f2..275af23dda6d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>  {
>         nvme_mpath_stop(ctrl);
>         nvme_auth_stop(ctrl);
> +       /*
> +        * the transport driver may be terminating the admin tagset a little
> +        * later on, so we cannot have the keep-alive work running
> +        */
> +       nvme_stop_keep_alive(ctrl);
>         nvme_stop_failfast_work(ctrl);
>         flush_work(&ctrl->async_event_work);
>         cancel_work_sync(&ctrl->fw_act_work);
> -- 
> 
I tried the above patch[2] and ran nvme/037 in a loop for quite a long time and 
confirmed that with this change I don't see the kernel crash. So I think we shall 
restore the above change. 

Also please note that the earlier changes where we made keep-alive synchronous 
operation has been already merged to the mainline/Linus' kernel tree. So should we 
create a new patch reverting the changes merged in commit d06923670b5a ("nvme: 
make keep-alive synchronous operation") and then on top of it add the above 
patch[2]? 

Moreover, I would adjust the new patch so that we can keep this change 599d9f3a10ee 
("nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function"). I think 
changes implemented in commit 599d9f3a10ee is a logical step forward to avoid using 
ctrl->lock and instead use helper nvme_ctrl_state for retrieving the controller state. 

What do you suggest? 

Thanks,
--Nilay