[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Mon Oct 21 08:14:56 PDT 2024

On 21/10/2024 14:58, Nilay Shroff wrote:
>
> On 10/21/24 03:19, Sagi Grimberg wrote:
>> Hey Nilay,
>>
>> First, can you please phrase the patch title to describe what it
>> is addressing, not what it is changing, that can come (as it does)
>> in the body.
>>
> sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?
>
>> Second, I don't think we want to spend a thread waiting for
>> synchronous IO for the purpose of keep alive. Is this the only
>> way that this can be done?
>>
>> Third, was this observed with non-loop transports?
>>
> Yes it could happen even with non-loop fabric transports.
>
>> On 16/10/2024 6:03, Nilay Shroff wrote:
>>> The nvme keep-alive operation, which executes at a periodic interval,
>>> could potentially sneak in while shutting down a fabric controller.
>>> This may lead to a race between the fabric controller admin queue
>>> destroy code path (invoked while shutting down controller) and hw/hctx
>>> queue dispatcher called from the nvme keep-alive async request queuing
>>> operation. This race could lead to the kernel crash shown below:
>> What is different about keep-alive than any other command?
> The one difference is, of course, keep-alive is async but we may have
> other async requests running while shutting down the controller. I found
> that, when compared the keep-alive with other commands, the keep-alive
> runs in the worker thread context but other commands (for instance async
> read/write) runs in the user context and that uses the blk plugging. When we
> use plugging, the call path for issuing request to driver is quite different
> from that of issuing request using blk_execute_rq_nowait(). This subtle
> difference is probably the reason not causing the kernel crash issuing
> read/write command while shutting down the fabric controller.
>
> The call path for blk plugging for issuing the request to driver:
> __blk_flush_plug()
>    -> blk_mq_flush_plug_list()
>      ->blk_mq_dispatch_plug_list()
>
> The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before
> dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue
> function) is protected from the queue being destroyed on other cpu thread.
> So that means that even after a request (assuming the only request in queue)
> is cancelled or flushed off due to fabric controller is shutting down, the
> blk-mq destroy code path would not be able to forward progress until the
> ->q_usage_counter is decremented and becomes zero. And we can see in
> blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after
> the blk_mq_run_hw_queue function returns.

Sounds like an issue that is not related to keep-alive at all.

>>> Call Trace:
>>>       autoremove_wake_function+0x0/0xbc (unreliable)
>>>       __blk_mq_sched_dispatch_requests+0x114/0x24c
>>>       blk_mq_sched_dispatch_requests+0x44/0x84
>>>       blk_mq_run_hw_queue+0x140/0x220
>>>       nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>>       process_one_work+0x200/0x4e0
>>>       worker_thread+0x340/0x504
>>>       kthread+0x138/0x140
>>>       start_kernel_thread+0x14/0x18
>> What is the "crash" here? use-after-free?
> yes it's use-after-free kernel crash.
>>> While shutting down fabric controller, if nvme keep-alive request sneaks
>>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>>> then invoked to handle the end of the keep-alive operation which
>>> decrements the admin->q_usage_counter and assuming this is the last/only
>>> request in the admin queue then the admin->q_usage_counter becomes zero.
>>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>>> queue()) which could be potentially running simultaneously on another
>>> cpu (as this is the controller shutdown code path) would forward
>>> progress and deletes the admin queue. So, now from this point onward
>>> we are not supposed to access the admin queue resources. However the
>>> issue here's that the nvme keep-alive thread running hw/hctx queue
>>> dispatch operation hasn't yet finished its work and so it could still
>>> potentially access the admin queue resource while the admin queue had
>>> been already deleted and that causes the above crash.
>> Again, it is unclear to me what justifies the cure here. Every command can
>> race with the shutdown, and it can be sync or async.
>>
> While I was researching on this issue, I figured that the nvme keep-alive is
> probably the only async request running in worker thread context and not well
> synchronized with the fabric controller shutdown code path. Yes we may have other
> async request (for instance async read/write) which might get-in the way of the
> fabric controller shutdown code path but those mostly run in the user context and
> hence have a different code path than keep-alive command for submitting/queuing
> the request.

This sounds like we papered around the problem. is it not allowed to 
call blk_execute_rq_nowait
from a kthread?

>
>>> This fix helps avoid the observed crash by implementing keep-alive as a
>>> synchronous operation so that we decrement admin->q_usage_counter only
>>> after keep-alive command finished its execution and returns the command
>>> status back up to its caller (blk_execute_rq()). This would ensure that
>>> fabric shutdown code path doesn't destroy the fabric admin queue until
>>> keep-alive request finished execution and also keep-alive thread is not
>>> running hw/hctx queue dispatch operation.
>> I'd want to see exactly what is causing this race because we already flush
>> the keep alive work when starting the shutdown... so it only means it is
>> incorrectly firing again?
>>
>> What I think we should do instead, is given that we already cancel_sync the keep
>> alive work before making progress with the teardown, we should just make sure
>> that it doesn't triggered afterwards again if it does.
> No, keep-alive doesn't incorrectly fired, however once a keep-alive request is
> allocated but before it's queued,

The queue teardown quiesce the admin queue first, which should prevent 
the dispatch.
You probably mean that it runs a little later, after the dispatch passes 
the quiesced test...

The question is, why did blk_mq_quiesce_queue did not wait for rcu 
grace? which should
have ensured that there is no dispatch going on? at least that is what 
it's supposed to do...
something doesn't add up