[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Wed Oct 23 02:40:14 PDT 2024

On 22/10/2024 16:28, Nilay Shroff wrote:
>
> On 10/22/24 15:13, Sagi Grimberg wrote:
>>
>>
>> On 22/10/2024 9:54, Nilay Shroff wrote:
>>> On 10/21/24 20:44, Sagi Grimberg wrote:
>>>>
>>>> On 21/10/2024 14:58, Nilay Shroff wrote:
>>>>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>>>>> Hey Nilay,
>>>>>>
>>>>>> First, can you please phrase the patch title to describe what it
>>>>>> is addressing, not what it is changing, that can come (as it does)
>>>>>> in the body.
>>>>>>
>>>>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>>>>
>>>>>> Second, I don't think we want to spend a thread waiting for
>>>>>> synchronous IO for the purpose of keep alive. Is this the only
>>>>>> way that this can be done?
>>>>>>
>>>>>> Third, was this observed with non-loop transports?
>>>>>>
>>>>> Yes it could happen even with non-loop fabric transports.
>>>>>
>>>>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>>>>> could potentially sneak in while shutting down a fabric controller.
>>>>>>> This may lead to a race between the fabric controller admin queue
>>>>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>>>>> operation. This race could lead to the kernel crash shown below:
>>>>>> What is different about keep-alive than any other command?
>>>>> The one difference is, of course, keep-alive is async but we may have
>>>>> other async requests running while shutting down the controller. I found
>>>>> that, when compared the keep-alive with other commands, the keep-alive
>>>>> runs in the worker thread context but other commands (for instance async
>>>>> read/write) runs in the user context and that uses the blk plugging. When we
>>>>> use plugging, the call path for issuing request to driver is quite different
>>>>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>>>>> difference is probably the reason not causing the kernel crash issuing
>>>>> read/write command while shutting down the fabric controller.
>>>>>
>>>>> The call path for blk plugging for issuing the request to driver:
>>>>> __blk_flush_plug()
>>>>>      -> blk_mq_flush_plug_list()
>>>>>        ->blk_mq_dispatch_plug_list()
>>>>>
>>>>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>>>>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>>>>> function) is protected from the queue being destroyed on other cpu thread.
>>>>> So that means that even after a request (assuming the only request in queue)
>>>>> is cancelled or flushed off due to fabric controller is shutting down, the
>>>>> blk-mq destroy code path would not be able to forward progress until the
>>>>> ->q_usage_counter is decremented and becomes zero. And we can see in
>>>>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>>>>> the blk_mq_run_hw_queue function returns.
>>>> Sounds like an issue that is not related to keep-alive at all.
>>>>
>>>>>>> Call Trace:
>>>>>>>         autoremove_wake_function+0x0/0xbc (unreliable)
>>>>>>>         __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>>>>         blk_mq_sched_dispatch_requests+0x44/0x84
>>>>>>>         blk_mq_run_hw_queue+0x140/0x220
>>>>>>>         nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>>>>         process_one_work+0x200/0x4e0
>>>>>>>         worker_thread+0x340/0x504
>>>>>>>         kthread+0x138/0x140
>>>>>>>         start_kernel_thread+0x14/0x18
>>>>>> What is the "crash" here? use-after-free?
>>>>> yes it's use-after-free kernel crash.
>>>>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>>>>> then invoked to handle the end of the keep-alive operation which
>>>>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>>>>> queue()) which could be potentially running simultaneously on another
>>>>>>> cpu (as this is the controller shutdown code path) would forward
>>>>>>> progress and deletes the admin queue. So, now from this point onward
>>>>>>> we are not supposed to access the admin queue resources. However the
>>>>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>>>>> dispatch operation hasn't yet finished its work and so it could still
>>>>>>> potentially access the admin queue resource while the admin queue had
>>>>>>> been already deleted and that causes the above crash.
>>>>>> Again, it is unclear to me what justifies the cure here. Every command can
>>>>>> race with the shutdown, and it can be sync or async.
>>>>>>
>>>>> While I was researching on this issue, I figured that the nvme keep-alive is
>>>>> probably the only async request running in worker thread context and not well
>>>>> synchronized with the fabric controller shutdown code path. Yes we may have other
>>>>> async request (for instance async read/write) which might get-in the way of the
>>>>> fabric controller shutdown code path but those mostly run in the user context and
>>>>> hence have a different code path than keep-alive command for submitting/queuing
>>>>> the request.
>>>> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
>>>> from a kthread?
>>> I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed
>>> but the side effect it causes is what I just wanted to highlight.
>> The point is that any kthread using it would potentially trigger this bug. Hence we should
>> treat the disease and not the symptom.
>>
>>>>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>>>>> after keep-alive command finished its execution and returns the command
>>>>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>>>>> keep-alive request finished execution and also keep-alive thread is not
>>>>>>> running hw/hctx queue dispatch operation.
>>>>>> I'd want to see exactly what is causing this race because we already flush
>>>>>> the keep alive work when starting the shutdown... so it only means it is
>>>>>> incorrectly firing again?
>>>>>>
>>>>>> What I think we should do instead, is given that we already cancel_sync the keep
>>>>>> alive work before making progress with the teardown, we should just make sure
>>>>>> that it doesn't triggered afterwards again if it does.
>>>>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>>>>> allocated but before it's queued,
>>>> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
>>>> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
>>>>
>>>> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
>>>> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
>>>> something doesn't add up
>>>>
>>> This issue manifests when the keep-alive command sneaks in after the admin queue is
>>> unquiesced but before its freezed.
>>>
>>> nvme_unquiesce_admin_queue() <<< we unquiesce admin queue
>>>
>>> <<< here keep-alive sneaks in and it starts running dispatcher
>>>
>>> nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero
>>>
>>> So as mentioned above, we have time window between queue unquiesce and freeze
>>> operations where async keep-alive could potentially sneaks in and causing the
>>> observed symptom.
>> The problem here is that keep-alive should not be able to sneak in after we unquiescence
>> the admin queue.
>>
>> What I'm seeing in the code is that the teardown flow does not call nvme_stop_keep_alive()
>> in the teardown path.
>>
>> This seems to be caused by [1], so it makes sense that we can even trigger this race.
>> If nvme_stop_keep_alive would have been called from nvme_stop_ctrl() which is called
>> way before we quiesce/unquiesce the queue, then this issue would not have been encountered.
>>
>> I think we could make an argument that both call-sites to nvme_stop_keep_alive() should co-exist
>> with a little documentation to why.
>> Can you try and reproduce with [2]?
>>
>>
>> [1]:
>> -- 
>> commit a54a93d0e3599b05856971734e15418ac551a14c
>> Author: Ming Lei <ming.lei at redhat.com>
>> Date:   Tue Aug 13 09:35:27 2024 +0800
>>
>>      nvme: move stopping keep-alive into nvme_uninit_ctrl()
>>
>>      Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup")
>>      moves starting keep-alive from nvme_start_ctrl() into
>>      nvme_init_ctrl_finish(), but don't move stopping keep-alive into
>>      nvme_uninit_ctrl(), so keep-alive work can be started and keep pending
>>      after failing to start controller, finally use-after-free is triggered if
>>      nvme host driver is unloaded.
>>
>>      This patch fixes kernel panic when running nvme/004 in case that connection
>>      failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl().
>>
>>      This way is reasonable because keep-alive is now started in
>>      nvme_init_ctrl_finish().
>>
>>      Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>>      Cc: Hannes Reinecke <hare at suse.de>
>>      Cc: Mark O'Donovan <shiftee at posteo.net>
>>      Reported-by: Changhui Zhong <czhong at redhat.com>
>>      Reviewed-by: Christoph Hellwig <hch at lst.de>
>>      Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
>>      Reviewed-by: Hannes Reinecke <hare at suse.de>
>>      Signed-off-by: Ming Lei <ming.lei at redhat.com>
>>      Signed-off-by: Keith Busch <kbusch at kernel.org>
>> -- 
>>
>> This patch was introduced as a fix for:
>> 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position")
>>
>> [2]:
>> -- 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 0dc8bcc664f2..275af23dda6d 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -4612,6 +4612,11 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>>   {
>>          nvme_mpath_stop(ctrl);
>>          nvme_auth_stop(ctrl);
>> +       /*
>> +        * the transport driver may be terminating the admin tagset a little
>> +        * later on, so we cannot have the keep-alive work running
>> +        */
>> +       nvme_stop_keep_alive(ctrl);
>>          nvme_stop_failfast_work(ctrl);
>>          flush_work(&ctrl->async_event_work);
>>          cancel_work_sync(&ctrl->fw_act_work);
>> -- 
>>
> I tried the above patch[2] and ran nvme/037 in a loop for quite a long time and
> confirmed that with this change I don't see the kernel crash. So I think we shall
> restore the above change.

Yes, and we need a "Fixes:" tag because it solves a regression caused by [1]

>
> Also please note that the earlier changes where we made keep-alive synchronous
> operation has been already merged to the mainline/Linus' kernel tree. So should we
> create a new patch reverting the changes merged in commit d06923670b5a ("nvme:
> make keep-alive synchronous operation") and then on top of it add the above
> patch[2]?

Yes, one revert patch and one patch for [2]

>
> Moreover, I would adjust the new patch so that we can keep this change 599d9f3a10ee
> ("nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function"). I think
> changes implemented in commit 599d9f3a10ee is a logical step forward to avoid using
> ctrl->lock and instead use helper nvme_ctrl_state for retrieving the controller state.

That is fine to keep.