lockdep WARNING at blktests block/011

Wed Oct 5 07:20:00 PDT 2022

On Wed, Oct 05, 2022 at 07:00:30PM +0900, Tetsuo Handa wrote:
> On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> > @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
> >  void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> >  {
> >  	struct nvme_ns *ns;
> > +	LIST_HEAD(splice);
> >  
> > -	down_read(&ctrl->namespaces_rwsem);
> > -	list_for_each_entry(ns, &ctrl->namespaces, list)
> > +	/*
> > +	 * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> > +	 * triggers deadlock warning by lockdep since cancel_work_sync() in
> > +	 * blk_sync_queue() waits for nvme_timeout() work completion which may
> > +	 * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> > +	 * call blk_sync_queues() out of the critical section by moving the
> > +         * ctrl->namespaces list elements to the stack list head temporally.
> > +	 */
> > +
> > +	down_write(&ctrl->namespaces_rwsem);
> > +	list_splice_init(&ctrl->namespaces, &splice);
> > +	up_write(&ctrl->namespaces_rwsem);
> 
> Does this work?
> 
> ctrl->namespaces being empty when calling blk_sync_queue() means that
> e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?

There can't be anything to timeout at this point. The controller is disabled
prior to syncing the queues. Not only is there no IO for timeout work to
operate on, the controller state is already disabled, so a subsequent freeze
would be skipped.

>   blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
>     blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
>       bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
>         blk_mq_rq_timed_out() {
>           req->q->mq_ops->timeout(req) == nvme_timeout(req) {
>             nvme_dev_disable() {
>               mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
>               nvme_start_freeze(&dev->ctrl) {
>                 down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>                 //blk_freeze_queue_start(ns->queue); // <= Never be called because ctrl->namespaces is empty.
>                 up_read(&ctrl->namespaces_rwsem);
>               }
>               mutex_unlock(&dev->shutdown_lock);
>             }
>           }
>         }
>       }
>     }
>   }
> 
> Are you sure that down_read(&ctrl->namespaces_rwsem) users won't run
> when ctrl->namespaces is temporarily made empty? (And if you are sure
> that down_read(&ctrl->namespaces_rwsem) users won't run when
> ctrl->namespaces is temporarily made empty, why ctrl->namespaces_rwsem
> needs to be a rw-sem rather than a plain mutex or spinlock ?)

We iterate the list in some target fast paths, so we don't want this to be an
exclusive section for readers.

> > +	list_for_each_entry(ns, &splice, list)
> >  		blk_sync_queue(ns->queue);
> > -	up_read(&ctrl->namespaces_rwsem);
> > +
> > +	down_write(&ctrl->namespaces_rwsem);
> > +	list_splice(&splice, &ctrl->namespaces);
> > +	up_write(&ctrl->namespaces_rwsem);
> >  }
> >  EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
> 
> I don't know about dependency chain, but you might be able to add
> "struct nvme_ctrl"->sync_io_queue_mutex which is held for serializing
> nvme_sync_io_queues() and down_write(&ctrl->namespaces_rwsem) users?
> 
> If we can guarantee that ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex
> is impossible, nvme_sync_io_queues() can use ctrl->sync_io_queue_mutex
> rather than ctrl->namespaces_rwsem, and down_write(&ctrl->namespaces_rwsem)/
> up_write(&ctrl->namespaces_rwsem) users are replaced with
>   mutex_lock(&ctrl->sync_io_queue_mutex);
>   down_write(&ctrl->namespaces_rwsem);
> and
>   up_write(&ctrl->namespaces_rwsem);
>   mutex_unlock(&ctrl->sync_io_queue_mutex);
> sequences respectively.
>