blktests failures with v5.19-rc1

Wed Jun 15 15:01:39 PDT 2022

On 6/15/22 12:47, Bjorn Helgaas wrote:
> On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
>> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
>>> Shinichiro,
>>>
>>> On 6/13/22 19:23, Keith Busch wrote:
>>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
>>>>> (CC+: linux-pci)
>>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
>>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch at kernel.org> wrote:
>>>>>>>
>>>>>>> And I am not even sure this is real. I don't know yet why
>>>>>>> this is showing up only now, but this should fix it:
>>>>>>
>>>>>> Hi Keith
>>>>>>
>>>>>> Confirmed the WARNING issue was fixed with the change, here is
>>>>>> the log:
>>>>>
>>>>> Thanks. I also confirmed that Keith's change to add
>>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on
>>>>> v5.19-rc2.
>>>>>
>>>>> I took a closer look into this issue and found The deadlock
>>>>> WARN can be recreated with following two commands:
>>>>>
>>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
>>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
>>>>>
>>>>> And it can be recreated with PCI devices other than NVME
>>>>> controller, such as SCSI controller or VGA controller. Then
>>>>> this is not a storage sub-system issue.
>>>>>
>>>>> I checked function call stacks of the two commands above. As
>>>>> shown below, it looks like ABBA deadlock possibility is
>>>>> detected and warned.
>>>>
>>>> Yeah, I was mistaken on this report, so my proposal to suppress
>>>> the warning is definitely not right. If I run both 'echo'
>>>> commands in parallel, I see it deadlock frequently. I'm not
>>>> familiar enough with this code to any good ideas on how to fix,
>>>> but I agree this is a generic pci issue.
>>>
>>> I think it is worth adding a testcase to blktests to make sure
>>> these future releases will test this.
>>
>> Yeah, this WARN is confusing for us then it would be valuable to
>> test by blktests not to repeat it. One point I wonder is: which test
>> group the test case will it fall in? The nvme group could be the
>> group to add, probably.
>>

since this issue been discovered with nvme rescan and revmoe,
it should be added to the nvme category.

>> Another point I wonder is other kernel test suite than blktests.
>> Don't we have more appropriate test suite to check PCI device
>> rescan/remove race ? Such a test sounds more like a PCI bus
>> sub-system test than block/storage test.

I don't think so we could have caught it long time back,
but we clearly did not.

> 
> I'm not aware of such a test, but it would be nice to have one.
> 
> Can you share your qemu config so I can reproduce this locally?
> 
> Thanks for finding and reporting this!
> 
> Bjorn

-ck