[PATCH blktests 0/2] add nvme test for creating sleep while atomic kernel BUG

Wed Dec 4 02:08:18 PST 2024

On 12/3/24 13:56, Chaitanya Kulkarni wrote:
> On 12/2/24 21:38, Nilay Shroff wrote:
>> Hitting the kernel BUG depends on the race. In the test case, we disable the target ns
>> and then write to it and there's a time window between "disabling ns and writing to it".
>> During this time window, after disabling ns but before we actually begin writing to it,
>> if the target could clean up ns and remove it from subsystem Xarray then we may not hit
>> this BUG. So I run the test case in a loop for 10 times hoping that we'd hit it at-least
>> once. However, on my test system, I could hit it 2-3 times for each run of the test.
> 
> Thanks for the explanation, however we need a test that will hit the bug 
> 100% of the
> time and will avoid different behavior when users run it multiple times.
> 
> Luis has shared his general experience running block test and conclusion 
> was we need to
> have tests that are consistent and not have different results when 
> executed multiple
> times. From the feedback we got we can't really guarantee that every 
> user will know
> this and or adjust the testcase running loop to hit the bug and run it 
> for multiple
> times, that brings down effectiveness of the test. Not only that it also 
> becomes real
> problem when to build a CI on the top of blktest.
> 
> How about we add an error injection code so it will prolong the race 
> window in such
> a way it will stop the target from cleaning up the namespace and 
> removing it from
> xarray when disable ns command is executed and then writing to it ? 
> of-course
> before disabling the ns we will have to enable the corresponding error 
> injection code
> potentially sleep.
> 
> This is guarantee that we will his the race window and make the test 
> effective 100%
> of the time.
> 
> Or there any other simple solution we can think of ?
> 
I got what are your concerns here... 

So I have devised a way to recreate this issue 100% of time without needing a user to 
re-run the test multiple times. Idea here's that before we disable the target ns and write 
to it, we would first disable the ns-changed asynchronous event notification which target 
sends to the host whenever it detects any changes to the ns (including ns addition, removal 
etc.). So then later when we disable the ns on target, it wouldn't generate AEN for ns removal
and that would allow the host to write to a ns which is disabled on target. With this change, 
the test would trigger the kernel BUG 100% of time.

I will spin a new patch with the above change and send it for review later today.

Thanks,
--Nilay