[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Mon Oct 21 04:58:15 PDT 2024

On 10/21/24 03:19, Sagi Grimberg wrote:
> Hey Nilay,
> 
> First, can you please phrase the patch title to describe what it
> is addressing, not what it is changing, that can come (as it does)
> in the body.
> 
sure, how about naming it as "nvme-fabrics: fix kernel crash while shutting down controller" ?

> Second, I don't think we want to spend a thread waiting for
> synchronous IO for the purpose of keep alive. Is this the only
> way that this can be done?
> 

> Third, was this observed with non-loop transports?
> 
Yes it could happen even with non-loop fabric transports.

> 
> On 16/10/2024 6:03, Nilay Shroff wrote:
>> The nvme keep-alive operation, which executes at a periodic interval,
>> could potentially sneak in while shutting down a fabric controller.
>> This may lead to a race between the fabric controller admin queue
>> destroy code path (invoked while shutting down controller) and hw/hctx
>> queue dispatcher called from the nvme keep-alive async request queuing
>> operation. This race could lead to the kernel crash shown below:
> 
> What is different about keep-alive than any other command?
The one difference is, of course, keep-alive is async but we may have
other async requests running while shutting down the controller. I found 
that, when compared the keep-alive with other commands, the keep-alive
runs in the worker thread context but other commands (for instance async 
read/write) runs in the user context and that uses the blk plugging. When we 
use plugging, the call path for issuing request to driver is quite different 
from that of issuing request using blk_execute_rq_nowait(). This subtle 
difference is probably the reason not causing the kernel crash issuing 
read/write command while shutting down the fabric controller.

The call path for blk plugging for issuing the request to driver:
__blk_flush_plug()
  -> blk_mq_flush_plug_list()
    ->blk_mq_dispatch_plug_list()

The blk_mq_dispatch_plug_list() first increments ->q_usage_counter before 
dispatching requests. Basically here the queue dispatcher (blk_mq_run_hw_queue 
function) is protected from the queue being destroyed on other cpu thread.
So that means that even after a request (assuming the only request in queue) 
is cancelled or flushed off due to fabric controller is shutting down, the 
blk-mq destroy code path would not be able to forward progress until the 
->q_usage_counter is decremented and becomes zero. And we can see in 
blk_mq_dispatch_plug_list() that it decrements ->q_usage_counter only after 
the blk_mq_run_hw_queue function returns. 
> 
>>
>> Call Trace:
>>      autoremove_wake_function+0x0/0xbc (unreliable)
>>      __blk_mq_sched_dispatch_requests+0x114/0x24c
>>      blk_mq_sched_dispatch_requests+0x44/0x84
>>      blk_mq_run_hw_queue+0x140/0x220
>>      nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>>      process_one_work+0x200/0x4e0
>>      worker_thread+0x340/0x504
>>      kthread+0x138/0x140
>>      start_kernel_thread+0x14/0x18
> 
> What is the "crash" here? use-after-free?
yes it's use-after-free kernel crash.
> 
>>
>> While shutting down fabric controller, if nvme keep-alive request sneaks
>> in then it would be flushed off. The nvme_keep_alive_end_io function is
>> then invoked to handle the end of the keep-alive operation which
>> decrements the admin->q_usage_counter and assuming this is the last/only
>> request in the admin queue then the admin->q_usage_counter becomes zero.
>> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
>> queue()) which could be potentially running simultaneously on another
>> cpu (as this is the controller shutdown code path) would forward
>> progress and deletes the admin queue. So, now from this point onward
>> we are not supposed to access the admin queue resources. However the
>> issue here's that the nvme keep-alive thread running hw/hctx queue
>> dispatch operation hasn't yet finished its work and so it could still
>> potentially access the admin queue resource while the admin queue had
>> been already deleted and that causes the above crash.
> 
> Again, it is unclear to me what justifies the cure here. Every command can
> race with the shutdown, and it can be sync or async.
> 
While I was researching on this issue, I figured that the nvme keep-alive is
probably the only async request running in worker thread context and not well 
synchronized with the fabric controller shutdown code path. Yes we may have other
async request (for instance async read/write) which might get-in the way of the 
fabric controller shutdown code path but those mostly run in the user context and
hence have a different code path than keep-alive command for submitting/queuing 
the request.

>>
>> This fix helps avoid the observed crash by implementing keep-alive as a
>> synchronous operation so that we decrement admin->q_usage_counter only
>> after keep-alive command finished its execution and returns the command
>> status back up to its caller (blk_execute_rq()). This would ensure that
>> fabric shutdown code path doesn't destroy the fabric admin queue until
>> keep-alive request finished execution and also keep-alive thread is not
>> running hw/hctx queue dispatch operation.
> 
> I'd want to see exactly what is causing this race because we already flush
> the keep alive work when starting the shutdown... so it only means it is
> incorrectly firing again?
> 
> What I think we should do instead, is given that we already cancel_sync the keep
> alive work before making progress with the teardown, we should just make sure
> that it doesn't triggered afterwards again if it does.

No, keep-alive doesn't incorrectly fired, however once a keep-alive request is 
allocated but before it's queued, if the shutdown code path come in its way and 
hence mark the queue as dying (i.e. setting QUEUE_FLAG_DYING on request-queue as 
well as clearing NVME_TCP_Q_LIVE in case of tcp transport or clearing NVME_LOOP_Q_LIVE
in case of loop transport) then this issue manifests. Let me briefly now show the 
place in the queue dispatcher code where we see this crash.

block layer code path (cpu 0):
blk_execute_rq_nowait()
  -> blk_mq_run_hw_queue()
    -> blk_mq_sched_dispatch_requests()
      -> __blk_mq_sched_dispatch_requests()
        -> blk_mq_do_dispatch_ctx()
          -> blk_mq_dispatch_rq_list()
            -> q->mq_ops->queue_rq()

As we see in the above trace, the function which is called to submit req to 
the driver is ->queue_rq(). Assuming we've async keep-alive request queued 
just after the request-queue has been marked as dying, the driver would not 
process this request but instead fails that request: 

loop driver call path for failed request(cpu 0):
nvme_loop_queue_rq()
  -> nvme_fail_nonready_command()
    -> nvme_host_path_error()
      -> nvme_complete_rq() 
        -> nvme_end_req()
          -> blk_mq_end_request()
            -> __blk_mq_end_request()
              -> nvme_keep_alive_end_io() 
                -> blk_mq_free_request() <<< here we decrement the ->q_usage_counter

Assuming the keep-alive is the only command in the queue, nvme_keep_alive_end_io()
decrements ->q_usage_counter in the above code path and hence now ->q_usage_counter
reaches to zero. At this time the shutdown code path, which could be potentially running 
on another cpu thread and is waiting for ->q_usage_counter to become zero, would now 
forward progress and deletes the queue. Please see below:

fabric controller shutdown code path (cpu 1):
nvme_remove_admin_tag_set()
  -> blk_mq_destroy_queue() 
    -> blk_mq_freeze_queue_wait() <<< here we wait for ->q_usage_counter to become zero

If the ->q_usage_counter becomes zero, the above code forward progress. So now the 
blk_mq_destroy_queue function next invokes blk_put_queue() which would delete the 
admin queue. 

Referring back to the loop diver call path, running on cpu 0, would now return back 
to the block layer. The block layer hasn't yet finished the dispatcher work and it
returns back to blk_mq_dispatch_rq_list(). The blk_mq_dispatch_rq_list() as well as 
its caller blk_mq_do_dispatch_ctx() running dispatcher code could still access the 
@hctx, however this @hctx had been already deleted on another cpu (under the controller 
shutdown code path as described above) and that cause the kernel crash. 

I hope that above description shall help with understanding the root cause. But please 
let me know if still have any further question.

Thanks,
--Nilay