[PATCH] nvme: add nvme pci timeout testcase
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Wed Jan 10 20:41:57 PST 2024
On Jan 11, 2024 / 00:00, Chaitanya Kulkarni wrote:
> On 1/10/2024 4:23 AM, Shinichiro Kawasaki wrote:
> > Thanks again for this test case. Please find my review comments. Tomorrow, I
> > will try to run this test case.
> >
> > On Jan 09, 2024 / 19:57, Chaitanya Kulkarni wrote:
> >> Trigger and test nvme-pci timeout with concurrent fio jobs.
> >
> > Is there any related kernel commit which motivated this test case? If so,
> > can we add a kernel commit or e-mail discussion link as a reference?
> >
>
> no, this part if never tested on the regular basis as it has some
> complicated logic I believe it is much needed test ..
Okay, it's good to expand the test coverage.
...
> >> +}
> >> +
> >> +test() {
> >> + local dev="/dev/nvme0n1"
> >> +
> >> + echo "Running ${TEST_NAME}"
> >> +
> >> + echo 1 > /sys/block/nvme0n1/io-timeout-fail
> >> +
> >> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> >> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> >> + echo -1 > /sys/kernel/debug/fail_io_timeout/times
> >> + echo 0 > /sys/kernel/debug/fail_io_timeout/space
> >> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
> >
> > To avoid impact on following test cases, I suggest to restore the original
> > values of the fail_io_timeout/* sysfs attributes above at the end of this
> > test case. _nvme_err_inject_setup() and _nvme_err_inject_cleanup() in
> > test/nvme/rc do similar thing. We can add new helper functions in same manner.
>
> I can add the code for config and restore, but at this point there are
> no other testcases using this ?(correct me if I'm wrong),
This new test case is the only one user of fail_io_timeout, I believe.
> so instead of
> bloating the code in nvme/rc let's keep it for this testcase only ?
> and add common helpers code later once we have more users for this
> case ?
I think either way is good.
>
> >
> >> +
> >> + # shellcheck disable=SC2046
> >> + fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \
> >
> > Double quotes for "$(nproc)" will allow to remove the shellcheck exception,
> > probably.
>
> not sure I understand this comment, can you please elaborate ?
I suggest this:
diff --git a/tests/nvme/050 b/tests/nvme/050
index ba54bcd..8e5acb2 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -28,8 +28,7 @@ test() {
echo 0 > /sys/kernel/debug/fail_io_timeout/space
echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
- # shellcheck disable=SC2046
- fio --bs=4k --rw=randread --norandommap --numjobs=$(nproc) \
+ fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
--name=reads --direct=1 --filename=${dev} --group_reporting \
--time_based --runtime=1m 2>&1 | grep -q "Input/output error"
>
> >
> >> + --name=reads --direct=1 --filename=${dev} --group_reporting \
> >> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
> >> +
> >> + # shellcheck disable=SC2181
> >> + if [ $? -eq 0 ]; then
> >> + echo "Test complete"
> >> + else
> >> + echo "Test failed"
> >> + fi
> >> + modprobe -r nvme
> >
> > If nvme driver is built-in, this unload command does not work.
> > Do we really need to unload nvme driver here?
> >
>
> Yes, post timeout handler execution we need to make sure that device
> can be removed at the time of module unload, perhaps there is a way to
> avoid this when nvme is a built-in module so that test will not execute
> above ? any suggestions ?
I see... One idea I can think of is to remove the device using PCI device
sysfs. How about the code below? block/011 does similar thing. To do that
for TEST_DEV, we can use _get_pci_from_dev_sysfs() in place of
_get_pci_from_dev_sysfs(). Does this meet your intent?
diff --git a/tests/nvme/050 b/tests/nvme/050
index 8e5acb2..592f167 100755
--- a/tests/nvme/050
+++ b/tests/nvme/050
@@ -17,6 +17,9 @@ requires() {
test() {
local dev="/dev/nvme0n1"
+ local pdev
+
+ pdev="$(_get_pci_from_dev_sysfs /dev/nvme0n1)"
echo "Running ${TEST_NAME}"
@@ -38,5 +41,11 @@ test() {
else
echo "Test failed"
fi
- modprobe -r nvme
+
+ # Remove and rescan the NVME device to ensure that it has come back
+ echo 1 > "/sys/bus/pci/devices/$pdev/remove"
+ echo 1 > /sys/bus/pci/rescan
+ if [[ ! -b $TEST_DEV ]]; then
+ echo "Failed to regain $TEST_DEV"
+ fi
}
BTW, when I ran the test case, I observed the test case failed. I took a quick
look, and saw the nvme driver reported I/O timeout in kernel messages. But fio
did not report the expected I/O error. Not sure why. I will look closer when I
check v2 patch.
More information about the Linux-nvme
mailing list