blktests failures with v7.1-rc1 kernel
Nilay Shroff
nilay at linux.ibm.com
Thu May 28 22:52:55 PDT 2026
On 5/28/26 10:54 AM, Shin'ichiro Kawasaki wrote:
> On May 25, 2026 / 18:14, Nilay Shroff wrote:
>> hi Shinichiro,
>>
>> On 4/28/26 2:43 PM, Shin'ichiro Kawasaki wrote:
> [...]
>>> #1: nvme/005,063 (tcp transport)
>>>
>>> The test cases nvme/005 and 063 fail for tcp transport due to the lockdep
>>> WARN related to the three locks q->q_usage_counter, q->elevator_lock and
>>> set->srcu. The failure was reported first time for nvme/063 and v6.16-rc1
>>> kernel [2].
>>>
>>> Chaitanya provided a fix patch (thanks!), and it is queued for v7.1-rcX tags
>>> [3]. However, nvme/005 and 063 still fail even when I apply the fix patch to
>>> v7.1-rc1 kernel. The call traces of the lockdep WARN are different between
>>> "v7.1-rc1" kernel [4] and "v7.1-rc1+the fix patch" kernel [5]. I guess that
>>> there exist two lockdep problems with similar symptoms and patch [3] fixed
>>> one of them. I guess that still one problem is left.
>>>
>>> [2]https://lore.kernel.org/linux-block/4fdm37so3o4xricdgfosgmohn63aa7wj3ua4e5vpihoamwg3ui@fq42f5q5t5ic/
>>> [3]https://lore.kernel.org/all/20260413171628.6204-1-kch@nvidia.com/
>>
>>
>> I looked into this lockdep warning, and it seems that Chaitanya's patch indeed fixes the
>> original issue reported in [4]. However, the new warning reported in [5] appears to be a
>> separate lockdep splat and, from what I can tell, likely a false positive. There are two
>> reasons why I think so:
>>
>> 1. The lockdep report suggests that thread #1 is sending data over a TCP socket while
>> another thread #2 is still in the process of establishing that same socket connection.
>> In practice, this should not be possible because request dispatch over the socket can
>> only happen after the connection setup has completed successfully.
>>
>> 2. The warning also suggests that while thread #0 is deleting the gendisk and unregistering
>> the corresponding request queue, another thread #5 is concurrently attempting to change
>> the queue elevator. However, once gendisk deletion starts, elevator switching is already
>> inhibited for that queue (see disable_elv_switch()), so the reported locking scenario
>> should not be reachable in practice.
>>
>> Based on the above, I suspect this is a lockdep false positive caused by dependency tracking
>> across different queue/socket lifecycle phases. We may need to suppress lock dependency tracking
>> in some of these paths to avoid the false warning.
>
> Hi Nilay, thank you very much looking into this. It is good to know that
> Chaitanya's patch fixed one problem, and the other problem looks like a false-
> positive.
>
> To confirm that "lockdep false positive caused by dependency tracking across
> different queue/socket lifecycle phases", I created the patch attached. It
> uses dynamic lockdep keys for the sockets of nvme-tcp controllers. With this
> patch, the WARN at nvme/005 disappears! I think this indicates that your
> suspect is correct. I will do some more testing and post the patch.
Thanks for working on the patch! I reviewed it and the changes look good to me.
I agree assigning a unique lockdep key to each nvmf-tcp socket is the right
solution.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list