[PATCH 2/2] nvme: make keep-alive synchronous operation

Mon Oct 7 00:55:56 PDT 2024

On 10/7/24 12:11, Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 05:16:57PM +0530, Nilay Shroff wrote:
>> The nvme keep-alive operation, which executes at a periodic interval,
>> could potentially sneak in while shutting down a fabric controller.
>> This may lead to a race between the fabric controller admin queue
>> destroy code path (while shutting down controller) and the blk-mq
>> hw/hctx queuing from the keep-alive thread.
>>
>> This fix helps avoid race by implementing keep-alive as a synchronous
>> operation so that admin queue-usage ref counter is decremented only
>> after keep-alive command finish execution and returns its status.
> 
> With that you mean ->q_usage_counter?
Yes I meant ->q_usage_counter.
> 
> Moving to synchronous submission and wasting a workqueue context for
> that is a bit sad.  I think just removing the blk_mq_free_request call
> from nvme_keep_alive_finish and returning RQ_END_IO_FREE instead
> should have the same effect, or am I missing something?
> 
Unfortunately, that would not help because we still fall through the same 
code path and here we're just moving blk_mq_free_request call from 
nvme_keep_alive_finish to its caller. For instance, assuming we keep 
nvme_keep_alive work asynchronous then the code path for invoking 
nvme_keep_alive_finish would be,

nvme_keep_alive_work()
  ->blk_execute_rq_no_wait()
    ->blk_mq_run_hw_queue()
      ->blk_mq_sched_dispatch_requests()
        ->__blk_mq_sched_dispatch_requests()
          ->blk_mq_dispatch_rq_list()
            ->nvme_loop_queue_rq()
              ->nvme_fail_nonready_command() 
                ->nvme_complete_rq()
                  ->nvme_end_req()
                    ->blk_mq_end_request()
                      ->__blk_mq_end_request()
                        ->nvme_keep_alive_finish() 

In the above call path, if nvme_keep_alive_finish returns RQ_END_IO_FREE (and we 
remove blk_mq_free_request call from it) then the caller __blk_mq_end_request would 
then invoke blk_mq_free_request and we may still hit the same bug because 
this would allow blk_mq_destroy_queue() (which may be running on another cpu from 
the controller shutdown code path) forward progress and next from blk_put_queue()
the admin queue resources are deleted.

>> Also, while we are at it, instead of first acquiring ctrl lock and then
>> accessing NVMe controller state, lets use the helper function
>> nvme_ctrl_state() in nvme_keep_alive_end_io() and get rid of the
>> lock.
> 
> Please split that into a separate patch.
Sure I would split the patch and resend the series. 

And thank you for your review comments!

--Nilay