[PATCH v4 2/3] nvme: make keep-alive synchronous operation

Sagi Grimberg sagi at grimberg.me
Sun Oct 20 14:49:25 PDT 2024


Hey Nilay,

First, can you please phrase the patch title to describe what it
is addressing, not what it is changing, that can come (as it does)
in the body.

Second, I don't think we want to spend a thread waiting for
synchronous IO for the purpose of keep alive. Is this the only
way that this can be done?

Third, was this observed with non-loop transports?


On 16/10/2024 6:03, Nilay Shroff wrote:
> The nvme keep-alive operation, which executes at a periodic interval,
> could potentially sneak in while shutting down a fabric controller.
> This may lead to a race between the fabric controller admin queue
> destroy code path (invoked while shutting down controller) and hw/hctx
> queue dispatcher called from the nvme keep-alive async request queuing
> operation. This race could lead to the kernel crash shown below:

What is different about keep-alive than any other command?

>
> Call Trace:
>      autoremove_wake_function+0x0/0xbc (unreliable)
>      __blk_mq_sched_dispatch_requests+0x114/0x24c
>      blk_mq_sched_dispatch_requests+0x44/0x84
>      blk_mq_run_hw_queue+0x140/0x220
>      nvme_keep_alive_work+0xc8/0x19c [nvme_core]
>      process_one_work+0x200/0x4e0
>      worker_thread+0x340/0x504
>      kthread+0x138/0x140
>      start_kernel_thread+0x14/0x18

What is the "crash" here? use-after-free?

>
> While shutting down fabric controller, if nvme keep-alive request sneaks
> in then it would be flushed off. The nvme_keep_alive_end_io function is
> then invoked to handle the end of the keep-alive operation which
> decrements the admin->q_usage_counter and assuming this is the last/only
> request in the admin queue then the admin->q_usage_counter becomes zero.
> If that happens then blk-mq destroy queue operation (blk_mq_destroy_
> queue()) which could be potentially running simultaneously on another
> cpu (as this is the controller shutdown code path) would forward
> progress and deletes the admin queue. So, now from this point onward
> we are not supposed to access the admin queue resources. However the
> issue here's that the nvme keep-alive thread running hw/hctx queue
> dispatch operation hasn't yet finished its work and so it could still
> potentially access the admin queue resource while the admin queue had
> been already deleted and that causes the above crash.

Again, it is unclear to me what justifies the cure here. Every command can
race with the shutdown, and it can be sync or async.

>
> This fix helps avoid the observed crash by implementing keep-alive as a
> synchronous operation so that we decrement admin->q_usage_counter only
> after keep-alive command finished its execution and returns the command
> status back up to its caller (blk_execute_rq()). This would ensure that
> fabric shutdown code path doesn't destroy the fabric admin queue until
> keep-alive request finished execution and also keep-alive thread is not
> running hw/hctx queue dispatch operation.

I'd want to see exactly what is causing this race because we already flush
the keep alive work when starting the shutdown... so it only means it is
incorrectly firing again?

What I think we should do instead, is given that we already cancel_sync 
the keep
alive work before making progress with the teardown, we should just make 
sure
that it doesn't triggered afterwards again if it does.



More information about the Linux-nvme mailing list