[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Tue Oct 22 02:43:32 PDT 2024

On 22/10/2024 9:54, Nilay Shroff wrote:
>
> On 10/21/24 20:44, Sagi Grimberg wrote:
>>
>>
>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>> Hey Nilay,
>>>>
>>>> First, can you please phrase the patch title to describe what it
>>>> is addressing, not what it is changing, that can come (as it does)
>>>> in the body.
>>>>
>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>
>>>> Second, I don't think we want to spend a thread waiting for
>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>> way that this can be done?
>>>>
>>>> Third, was this observed with non-loop transports?
>>>>
>>> Yes it could happen even with non-loop fabric transports.
>>>
>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>> This may lead to a race between the fabric controller admin queue
>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>> operation. This race could lead to the kernel crash shown below:
>>>> What is different about keep-alive than any other command?
>>> The one difference is, of course, keep-alive is async but we may have
>>> other async requests running while shutting down the controller. I found
>>> that, when compared the keep-alive with other commands, the keep-alive
>>> runs in the worker thread context but other commands (for instance async
>>> read/write) runs in the user context and that uses the blk plugging. When we
>>> use plugging, the call path for issuing request to driver is quite different
>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>> difference is probably the reason not causing the kernel crash issuing
>>> read/write command while shutting down the fabric controller.
>>>
>>> The call path for blk plugging for issuing the request to driver:
>>> __blk_flush_plug()
>>>     -> blk_mq_flush_plug_list()
>>>       ->blk_mq_dispatch_plug_list()
>>>
>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>> function) is protected from the queue being destroyed on other cpu thread.
>>> So that means that even after a request (assuming the only request in queue)
>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>> blk-mq destroy code path would not be able to forward progress until the
>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>> the blk_mq_run_hw_queue function returns.
>> Sounds like an issue that is not related to keep-alive at all.
>>
>>>>> Call Trace:
>>>>>        autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>        __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>        blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>        blk_mq_run_hw_queue+0x140/0x220
>>>>>        nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>        process_one_work+0x200/0x4e0
>>>>>        worker_thread+0x340/0x504
>>>>>        kthread+0x138/0x140
>>>>>        start_kernel_thread+0x14/0x18
>>>> What is the "crash" here? use-after-free?
>>> yes it's use-after-free kernel crash.
>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>> then invoked to handle the end of the keep-alive operation which
>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>> queue()) which could be potentially running simultaneously on another
>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>> we are not supposed to access the admin queue resources. However the
>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>> potentially access the admin queue resource while the admin queue had
>>>>> been already deleted and that causes the above crash.
>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>> race with the shutdown, and it can be sync or async.
>>>>
>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>> probably the only async request running in worker thread context and not well
>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>> async request (for instance async read/write) which might get-in the way of the
>>> fabric controller shutdown code path but those mostly run in the user context and
>>> hence have a different code path than keep-alive command for submitting/queuing
>>> the request.
>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>> from a kthread?
> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
> but the side effect it causes is what I just wanted to highlight.

The point is that any kthread using it would potentially trigger this 
bug. Hence we should
treat the disease and not the symptom.

>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>> after keep-alive command finished its execution and returns the command
>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>> running hw/hctx queue dispatch operation.
>>>> I'd want to see exactly what is causing this race because we already flush
>>>> the keep alive work when starting the shutdown... so it only means it is
>>>> incorrectly firing again?
>>>>
>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>> alive work before making progress with the teardown, we should just make sure
>>>> that it doesn't triggered afterwards again if it does.
>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>> allocated but before it's queued,
>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>
>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>> something doesn't add up
>>
> This issue manifests when the keep-alive command sneaks in after the admin queue is
> unquiesced but before its freezed.
>
> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>
> <<< here keep-alive sneaks in and it starts running dispatcher
>
> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>
> So as mentioned above, we have time window between queue unquiesce and freeze
> operations where async keep-alive could potentially sneaks in and causing the
> observed symptom.

The problem here is that keep-alive should not be able to sneak in after 
we unquiescence
the admin queue.

What I'm seeing in the code is that the teardown flow does not call 
nvme_stop_keep_alive()
in the teardown path.

This seems to be caused by [1], so it makes sense that we can even 
trigger this race.
If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() 
which is called
way before we quiesce/unquiesce the queue, then this issue would not 
have been encountered.

I think we could make an argument that both call-sites to 
nvme_stop_keep_alive() should co-exist
with a little documentation to why.
Can you try and reproduce with [2]?


[1]:
--
commit a54a93d0e3599b05856971734e15418ac551a14c
Author: Ming Lei <ming.lei at redhat.com>
Date:   Tue Aug 13 09:35:27 2024 +0800

     nvme: move stopping keep-alive into nvme_uninit_ctrl()

     Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
     moves starting keep-alive from nvme_start_ctrl() into
     nvme_init_ctrl_finish(), but don't move stopping keep-alive into
     nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
     after failing to start controller, finally use-after-free is 
triggered if
     nvme host driver is unloaded.

     This patch fixes kernel panic when running nvme/004 in case that 
connection
     failure is triggered, by moving stopping keep-alive into 
nvme_uninit_ctrl().

     This way is reasonable because keep-alive is now started in
     nvme_init_ctrl_finish().

     Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to 
original position")
     Cc: Hannes Reinecke <hare at suse.de>
     Cc: Mark O'Donovan <shiftee at posteo.net>
     Reported-by: Changhui Zhong <czhong at redhat.com>
     Reviewed-by: Christoph Hellwig <hch at lst.de>
     Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
     Reviewed-by: Hannes Reinecke <hare at suse.de>
     Signed-off-by: Ming Lei <ming.lei at redhat.com>
     Signed-off-by: Keith Busch <kbusch at kernel.org>
--

This patch was introduced as a fix for:
3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")

[2]:
--

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0dc8bcc664f2..275af23dda6d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
  {
         nvme_mpath_stop(ctrl);
         nvme_auth_stop(ctrl);
+       /*
+        * the transport driver may be terminating the admin tagset a little
+        * later on, so we cannot have the keep-alive work running
+        */
+       nvme_stop_keep_alive(ctrl);
         nvme_stop_failfast_work(ctrl);
         flush_work(&ctrl->async_event_work);
         cancel_work_sync(&ctrl->fw_act_work);
--