[PATCH] nvme: add nvme pci timeout testcase

Shinichiro Kawasaki shinichiro.kawasaki at wdc.com
Wed Jan 10 20:41:57 PST 2024


On Jan 11, 2024 / 00:00, Chaitanya Kulkarni wrote:
> On 1/10/2024 4:23 AM, Shinichiro Kawasaki wrote:
> > Thanks again for this test case. Please find my review comments. Tomorrow, I
> > will try to run this test case.
> > 
> > On Jan 09, 2024 / 19:57, Chaitanya Kulkarni wrote:
> >> Trigger and test nvme-pci timeout with concurrent fio jobs.
> > 
> > Is there any related kernel commit which motivated this test case? If so,
> > can we add a kernel commit or e-mail discussion link as a reference?
> > 
> 
> no, this part if never tested on the regular basis as it has some
> complicated logic I believe it is much needed test ..

Okay, it's good to expand the test coverage.

...

> >> +}
> >> +
> >> +test() {
> >> +	local dev="/dev/nvme0n1"
> >> +
> >> +	echo "Running ${TEST_NAME}"
> >> +
> >> +	echo 1 > /sys/block/nvme0n1/io-timeout-fail
> >> +
> >> +	echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> >> +	echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> >> +	echo -1 > /sys/kernel/debug/fail_io_timeout/times
> >> +	echo  0 > /sys/kernel/debug/fail_io_timeout/space
> >> +	echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
> > 
> > To avoid impact on following test cases, I suggest to restore the original
> > values of the fail_io_timeout/* sysfs attributes above at the end of this
> > test case. _nvme_err_inject_setup() and _nvme_err_inject_cleanup() in
> > test/nvme/rc do similar thing. We can add new helper functions in same manner.
> 
> I can add the code for config and restore, but at this point there are
> no other testcases using this ?(correct me if I'm wrong),

This new test case is the only one user of fail_io_timeout, I believe.

> so instead of
> bloating the code in nvme/rc let's keep it for this testcase only ?
> and add common helpers code later once we have more users for this
> case ?

I think either way is good.

> 
> > 
> >> +
> >> +	# shellcheck disable=SC2046
> >> +	fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \
> > 
> > Double quotes for "$(nproc)" will allow to remove the shellcheck exception,
> > probably.
> 
> not sure I understand this comment, can you please elaborate ?

I suggest this:

diff --git a/tests/nvme/050 b/tests/nvme/050
index ba54bcd..8e5acb2 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -28,8 +28,7 @@ test() {
 	echo  0 > /sys/kernel/debug/fail_io_timeout/space
 	echo  1 > /sys/kernel/debug/fail_io_timeout/verbose
 
-	# shellcheck disable=SC2046
-	fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \
+	fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
 	    --name=reads --direct=1 --filename=${dev} --group_reporting \
 	    --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
 

> 
> > 
> >> +	    --name=reads --direct=1 --filename=${dev} --group_reporting \
> >> +	    --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
> >> +
> >> +	# shellcheck disable=SC2181
> >> +	if [ $? -eq 0 ]; then
> >> +		echo "Test complete"
> >> +	else
> >> +		echo "Test failed"
> >> +	fi
> >> +	modprobe -r nvme
> > 
> > If nvme driver is built-in, this unload command does not work.
> > Do we really need to unload nvme driver here?
> > 
> 
> Yes, post timeout handler execution we need to make sure that device
> can be removed at the time of module unload, perhaps there is a way to
> avoid this when nvme is a built-in module so that test will not execute
> above ? any suggestions ?

I see... One idea I can think of is to remove the device using PCI device
sysfs. How about the code below? block/011 does similar thing. To do that
for TEST_DEV, we can use _get_pci_from_dev_sysfs() in place of
_get_pci_from_dev_sysfs(). Does this meet your intent?

diff --git a/tests/nvme/050 b/tests/nvme/050
index 8e5acb2..592f167 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -17,6 +17,9 @@ requires() {
 
 test() {
 	local dev="/dev/nvme0n1"
+	local pdev
+
+	pdev="$(_get_pci_from_dev_sysfs /dev/nvme0n1)"
 
 	echo "Running ${TEST_NAME}"
 
@@ -38,5 +41,11 @@ test() {
 	else
 		echo "Test failed"
 	fi
-	modprobe -r nvme
+
+	# Remove and rescan the NVME device to ensure that it has come back
+	echo 1 > "/sys/bus/pci/devices/$pdev/remove"
+	echo 1 > /sys/bus/pci/rescan
+	if [[ ! -b $TEST_DEV ]]; then
+		echo "Failed to regain $TEST_DEV"
+	fi
 }


BTW, when I ran the test case, I observed the test case failed. I took a quick
look, and saw the nvme driver reported I/O timeout in kernel messages. But fio
did not report the expected I/O error. Not sure why. I will look closer when I
check v2 patch.


More information about the Linux-nvme mailing list