bad unlock balance WARNING at nvme/045

Tue Oct 18 03:57:41 PDT 2022

> Hello Hannes,
> 
> I observed "WARNING: bad unlock balance detected!" at nvme/045 [1]. As the Call
> Trace shows, nvme_auth_reset() has unbalanced mutex lock/unlock.
> 
> 	mutex_lock(&ctrl->dhchap_auth_mutex);
> 	list_for_each_entry(chap, &ctrl->dhchap_auth_list, entry) {
> 		mutex_unlock(&ctrl->dhchap_auth_mutex);
> 		flush_work(&chap->auth_work);
> 		__nvme_auth_reset(chap);
> 	}
> 	mutex_unlock(&ctrl->dhchap_auth_mutex);
> 
> I tried to remove the mutex_unlock in the list iteration with a patch [2], but
> it resulted in another "WARNING: possible recursive locking detected" [3]. I'm
> not sure but cause of this WARN could be __nvme_auth_work and
> nvme_dhchap_auth_work in same nvme_wq.
> 
> Could you take a look for fix?

I'm looking at the code and I think that the way the concurrent
negotiations and how dhchap_auth_mutex is handled is very fragile,
also why should the per-queue auth_work hold the controller-wide
dhchap_auth_mutex? The only reason I see is because nvme_auth_negotiate
is checking if the chap context is already queued? Why should we
allow that?

I'd suggest to splice dhchap_auth_list, to a local list and then just
flush nvmet_wq in teardown flows. Same for renegotiations/reset flows.
And we should prevent for the double-queuing of chap negotiations to
begin with, instead of handling them (I still don't understand why this
is permitted, but perhaps just return EBUSY in this case?)