lockdep WARNING at blktests block/011
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Wed Oct 5 19:30:52 PDT 2022
On Oct 05, 2022 / 08:20, Keith Busch wrote:
> On Wed, Oct 05, 2022 at 07:00:30PM +0900, Tetsuo Handa wrote:
> > On 2022/10/05 17:31, Shinichiro Kawasaki wrote:
> > > @@ -5120,11 +5120,27 @@ EXPORT_SYMBOL_GPL(nvme_start_admin_queue);
> > > void nvme_sync_io_queues(struct nvme_ctrl *ctrl)
> > > {
> > > struct nvme_ns *ns;
> > > + LIST_HEAD(splice);
> > >
> > > - down_read(&ctrl->namespaces_rwsem);
> > > - list_for_each_entry(ns, &ctrl->namespaces, list)
> > > + /*
> > > + * blk_sync_queues() call in ctrl->snamespaces_rwsem critical section
> > > + * triggers deadlock warning by lockdep since cancel_work_sync() in
> > > + * blk_sync_queue() waits for nvme_timeout() work completion which may
> > > + * lock the ctrl->snamespaces_rwsem. To avoid the deadlock possibility,
> > > + * call blk_sync_queues() out of the critical section by moving the
> > > + * ctrl->namespaces list elements to the stack list head temporally.
> > > + */
> > > +
> > > + down_write(&ctrl->namespaces_rwsem);
> > > + list_splice_init(&ctrl->namespaces, &splice);
> > > + up_write(&ctrl->namespaces_rwsem);
> >
> > Does this work?
> >
> > ctrl->namespaces being empty when calling blk_sync_queue() means that
> > e.g. nvme_start_freeze() cannot find namespaces to freeze, doesn't it?
>
> There can't be anything to timeout at this point. The controller is disabled
> prior to syncing the queues. Not only is there no IO for timeout work to
> operate on, the controller state is already disabled, so a subsequent freeze
> would be skipped.
Thank you. So, this temporary list move approach should be ok.
>
> > blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
> > blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
> > bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
> > blk_mq_rq_timed_out() {
> > req->q->mq_ops->timeout(req) == nvme_timeout(req) {
> > nvme_dev_disable() {
> > mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
> > nvme_start_freeze(&dev->ctrl) {
> > down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
> > //blk_freeze_queue_start(ns->queue); // <= Never be called because ctrl->namespaces is empty.
> > up_read(&ctrl->namespaces_rwsem);
> > }
> > mutex_unlock(&dev->shutdown_lock);
> > }
> > }
> > }
> > }
> > }
> > }
> >
> > Are you sure that down_read(&ctrl->namespaces_rwsem) users won't run
> > when ctrl->namespaces is temporarily made empty? (And if you are sure
> > that down_read(&ctrl->namespaces_rwsem) users won't run when
> > ctrl->namespaces is temporarily made empty, why ctrl->namespaces_rwsem
> > needs to be a rw-sem rather than a plain mutex or spinlock ?)
>
> We iterate the list in some target fast paths, so we don't want this to be an
> exclusive section for readers.
>
> > > + list_for_each_entry(ns, &splice, list)
> > > blk_sync_queue(ns->queue);
> > > - up_read(&ctrl->namespaces_rwsem);
> > > +
> > > + down_write(&ctrl->namespaces_rwsem);
> > > + list_splice(&splice, &ctrl->namespaces);
> > > + up_write(&ctrl->namespaces_rwsem);
> > > }
> > > EXPORT_SYMBOL_GPL(nvme_sync_io_queues);
> >
> > I don't know about dependency chain, but you might be able to add
> > "struct nvme_ctrl"->sync_io_queue_mutex which is held for serializing
> > nvme_sync_io_queues() and down_write(&ctrl->namespaces_rwsem) users?
> >
> > If we can guarantee that ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex
> > is impossible, nvme_sync_io_queues() can use ctrl->sync_io_queue_mutex
> > rather than ctrl->namespaces_rwsem, and down_write(&ctrl->namespaces_rwsem)/
> > up_write(&ctrl->namespaces_rwsem) users are replaced with
> > mutex_lock(&ctrl->sync_io_queue_mutex);
> > down_write(&ctrl->namespaces_rwsem);
> > and
> > up_write(&ctrl->namespaces_rwsem);
> > mutex_unlock(&ctrl->sync_io_queue_mutex);
> > sequences respectively.
Thanks for the idea (educational for me :). I read through code and did not find
the dependency "ctrl->namespaces_rwsem => ctrl->sync_io_queue_mutex". I also did
a quick implementation and observed the lockdep splat disappears. So, your idea
looks working. This approach needs slight additional memory, then I'm not so
sure if this is the best way. The list move fix still looks good enough for me.
I would like to wait for some more comments. Or, I can post a formal patch for
wider review.
--
Shin'ichiro Kawasaki
More information about the Linux-nvme
mailing list