[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Mon Oct 21 23:54:57 PDT 2024

On 10/21/24 20:44, Sagi Grimberg wrote:
> 
> 
> 
> On 21/10/2024 14:58, Nilay Shroff wrote:
>>
>> On 10/21/24 03:19, Sagi Grimberg wrote:
>>> Hey Nilay,
>>>
>>> First, can you please phrase the patch title to describe what it
>>> is addressing, not what it is changing, that can come (as it does)
>>> in the body.
>>>
>> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>>
>>> Second, I don't think we want to spend a thread waiting for
>>> synchronous IO for the purpose of keep alive. Is this the only
>>> way that this can be done?
>>>
>>> Third, was this observed with non-loop transports?
>>>
>> Yes it could happen even with non-loop fabric transports.
>>
>>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>>> The nvme keep-alive operation, which executes at a periodic interval,
>>>> could potentially sneak in while shutting down a fabric controller.
>>>> This may lead to a race between the fabric controller admin queue
>>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>>> queue dispatcher called from the nvme keep-alive async request queuing
>>>> operation. This race could lead to the kernel crash shown below:
>>> What is different about keep-alive than any other command?
>> The one difference is, of course, keep-alive is async but we may have
>> other async requests running while shutting down the controller. I found
>> that, when compared the keep-alive with other commands, the keep-alive
>> runs in the worker thread context but other commands (for instance async
>> read/write) runs in the user context and that uses the blk plugging. When we
>> use plugging, the call path for issuing request to driver is quite different
>> from that of issuing request using blk_execute_rq_nowait(). This subtle
>> difference is probably the reason not causing the kernel crash issuing
>> read/write command while shutting down the fabric controller.
>>
>> The call path for blk plugging for issuing the request to driver:
>> __blk_flush_plug()
>>    -> blk_mq_flush_plug_list()
>>      ->blk_mq_dispatch_plug_list()
>>
>> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
>> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
>> function) is protected from the queue being destroyed on other cpu thread.
>> So that means that even after a request (assuming the only request in queue)
>> is cancelled or flushed off due to fabric controller is shutting down, the
>> blk-mq destroy code path would not be able to forward progress until the
>> ->q_usage_counter is decremented and becomes zero. And we can see in
>> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
>> the blk_mq_run_hw_queue function returns.
> 
> Sounds like an issue that is not related to keep-alive at all.
> 
>>>> Call Trace:
>>>>       autoremove_wake_function+0x0/0xbc (unreliable)
>>>>       __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>>       blk_mq_sched_dispatch_requests+0x44/0x84
>>>>       blk_mq_run_hw_queue+0x140/0x220
>>>>       nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>>       process_one_work+0x200/0x4e0
>>>>       worker_thread+0x340/0x504
>>>>       kthread+0x138/0x140
>>>>       start_kernel_thread+0x14/0x18
>>> What is the "crash" here? use-after-free?
>> yes it's use-after-free kernel crash.
>>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>>> then invoked to handle the end of the keep-alive operation which
>>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>>> queue()) which could be potentially running simultaneously on another
>>>> cpu (as this is the controller shutdown code path) would forward
>>>> progress and deletes the admin queue. So, now from this point onward
>>>> we are not supposed to access the admin queue resources. However the
>>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>>> dispatch operation hasn't yet finished its work and so it could still
>>>> potentially access the admin queue resource while the admin queue had
>>>> been already deleted and that causes the above crash.
>>> Again, it is unclear to me what justifies the cure here. Every command can
>>> race with the shutdown, and it can be sync or async.
>>>
>> While I was researching on this issue, I figured that the nvme keep-alive is
>> probably the only async request running in worker thread context and not well
>> synchronized with the fabric controller shutdown code path. Yes we may have other
>> async request (for instance async read/write) which might get-in the way of the
>> fabric controller shutdown code path but those mostly run in the user context and
>> hence have a different code path than keep-alive command for submitting/queuing
>> the request.
> 
> This sounds like we papered around the problem. is it not allowed to call blk_execute_rq_nowait
> from a kthread?
I didn't meant calling blk_execute_rq_nowait is not allowed from kthread. It's allowed 
but the side effect it causes is what I just wanted to highlight. 
> 
>>
>>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>>> synchronous operation so that we decrement admin->q_usage_counter only
>>>> after keep-alive command finished its execution and returns the command
>>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>>> keep-alive request finished execution and also keep-alive thread is not
>>>> running hw/hctx queue dispatch operation.
>>> I'd want to see exactly what is causing this race because we already flush
>>> the keep alive work when starting the shutdown... so it only means it is
>>> incorrectly firing again?
>>>
>>> What I think we should do instead, is given that we already cancel_sync the keep
>>> alive work before making progress with the teardown, we should just make sure
>>> that it doesn't triggered afterwards again if it does.
>> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
>> allocated but before it's queued,
> 
> The queue teardown quiesce the admin queue first, which should prevent the dispatch.
> You probably mean that it runs a little later, after the dispatch passes the quiesced test...
> 
> The question is, why did blk_mq_quiesce_queue did not wait for rcu grace? which should
> have ensured that there is no dispatch going on? at least that is what it's supposed to do...
> something doesn't add up
> 
This issue manifests when the keep-alive command sneaks in after the admin queue is 
unquiesced but before its freezed.

nvme_unquiesce_admin_queue() <<< we unquiesce admin queue   

<<< here keep-alive sneaks in and it starts running dispatcher

nvme_remove_admin_tag_set()  <<< we freeze queue and wait for ->q_usage_counter to become zero 

So as mentioned above, we have time window between queue unquiesce and freeze 
operations where async keep-alive could potentially sneaks in and causing the 
observed symptom.

Thanks,
--Nilay