[PATCH blktests V2 1/1] nvme: add nvme pci timeout testcase
Chaitanya Kulkarni
chaitanyak at nvidia.com
Sun Jan 21 20:06:13 PST 2024
On 1/19/2024 3:11 AM, Shinichiro Kawasaki wrote:
> Hi Chaitanya, thanks for this v2 patch.
>
> I ran this new test case with several devices and observed these two:
>
> 1) The test case fails with my QEMU NMVE devices. It looks all of the inject
> failures get recovered by error handler, then no I/O error is reported to the
> fio process. I modified two fail_io_timeout parameters as follows to inject
> more failures, and saw the test case fails with the device. I suggest these
> parameters.
>
> @@ -49,8 +50,8 @@ test_device() {
> save_fi_settings
> echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
>
> - echo 99 > /sys/kernel/debug/fail_io_timeout/probability
> - echo 10 > /sys/kernel/debug/fail_io_timeout/interval
> + echo 100 > /sys/kernel/debug/fail_io_timeout/probability
> + echo 1 > /sys/kernel/debug/fail_io_timeout/interval
> echo -1 > /sys/kernel/debug/fail_io_timeout/times
> echo 0 > /sys/kernel/debug/fail_io_timeout/space
> echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
>
Let me test these and add to the V3. But it is really strange, current
code pass testcase 100% of the time without above changes ...
> 2) After I ran the test case, I see the test target device left with zero
> size. I think this is consistent as your report [*]. I still think device
> remove and rescan at the test case end is required to avoid impacts on
> following test cases. It will depend on the discussion for your report,
> though.
>
Yes, still waiting on that, hopefully we will get some information to
move this forward. The reason I didn't add remove/rescan deviec coz it
helped me debug the state of the device post timeout handler which is
turned out to be not consistent for sure ..
> [*] https://lore.kernel.org/linux-nvme/287b9ed9-6eb3-46d0-a6f0-a9d283245b90@nvidia.com/
>
> Please find my some more comments in line:
>
> On Jan 16, 2024 / 20:22, Chaitanya Kulkarni wrote:
>> Trigger and test nvme-pci timeout with concurrent fio jobs.
>>
>> Signed-off-by: Chaitanya Kulkarni <kch at nvidia.com>
>> ---
>> tests/nvme/050 | 74 ++++++++++++++++++++++++++++++++++++++++++++++
>> tests/nvme/050.out | 2 ++
>> 2 files changed, 76 insertions(+)
>> create mode 100755 tests/nvme/050
>> create mode 100644 tests/nvme/050.out
>>
>> diff --git a/tests/nvme/050 b/tests/nvme/050
>> new file mode 100755
>> index 0000000..6c44b43
>> --- /dev/null
>> +++ b/tests/nvme/050
>> @@ -0,0 +1,74 @@
>> +#!/bin/bash
>> +# SPDX-License-Identifier: GPL-3.0+
>> +# Copyright (C) 2024 Chaitanya Kulkarni
>> +#
>> +# Test NVMe-PCI timeout with FIO jobs by triggering the nvme_timeout function.
>> +#
>> +
>> +. tests/nvme/rc
>> +
>> +DESCRIPTION="test nvme-pci timeout with fio jobs"
>
> I ran this test case a ZNS drive today, and it looks working. I suggest to add:
>
> CAN_BE_ZONED=1
>
okay ..
>> +
>> +sysfs_path="/sys/kernel/debug/fail_io_timeout/"
>> +#restrict test to nvme-pci only
>> +nvme_trtype=pci
>> +
>> +# fault injection config array
>> +declare -A fi_array
>> +
>> +requires() {
>> + _require_nvme_trtype pci
>
> We can remove this line, since we added "nvme_trtype=pci" above.
yeah ..
>
>> + _have_fio
>> + _nvme_requires
>> + _have_kernel_option FAIL_IO_TIMEOUT
>
> Today, I found that this test case depends on another kernel option. Let's add,
>
> _have_kernel_option FAULT_INJECTION_DEBUG_FS
indeed good catch I'll add this ..
>
>> +}
>> +
>> +device_requires() {
>> + _require_test_dev_is_nvme
>> +}
>
> Actually, this check is not required here, since it is already done in
> group_device_requires() in tests/nvme/rc. It is a small left work to remove
> the same device_requires() in nvme/032.
>
okay will remove that ..
>> +
>> +save_fi_settings() {
>> + for fi_attr in probability interval times space verbose
>> + do
>> + fi_array["${fi_attr}"]=$(cat "${sysfs_path}/${fi_attr}")
>> + done
>> +}
>> +
>> +restore_fi_settings() {
>> + for fi_attr in probability interval times space verbose
>> + do
>> + echo "${fi_array["${fi_attr}"]}" > "${sysfs_path}/${fi_attr}"
>> + done
>> +}
>> +
>> +test_device() {
>> + local nvme_ns
>> + local io_fimeout_fail
>> +
>> + echo "Running ${TEST_NAME}"
>> +
>> + nvme_ns="$(basename "${TEST_DEV}")"
>> + io_fimeout_fail="$(cat /sys/block/"${nvme_ns}"/io-timeout-fail)"
>> + save_fi_settings
>> + echo 1 > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +
>> + echo 99 > /sys/kernel/debug/fail_io_timeout/probability
>> + echo 10 > /sys/kernel/debug/fail_io_timeout/interval
>> + echo -1 > /sys/kernel/debug/fail_io_timeout/times
>> + echo 0 > /sys/kernel/debug/fail_io_timeout/space
>> + echo 1 > /sys/kernel/debug/fail_io_timeout/verbose
>> +
>> + fio --bs=4k --rw=randread --norandommap --numjobs="$(nproc)" \
>> + --name=reads --direct=1 --filename="${TEST_DEV}" --group_reporting \
>> + --time_based --runtime=1m 2>&1 | grep -q "Input/output error"
>
> When I investigated the failure cause of this test case, I found fio output
> is useful. So, I suggest to keep the output in $FULL.
>
> --time_based --runtime=1m >&"$FULL"
>
okay and will modify the testcase for that ..
>> +
>> + # shellcheck disable=SC2181
>> + if [ $? -eq 0 ]; then
>
> With that $FULL file, the if statement can be as follows. And we don't need to
> disable SC2181 either.
>
> if grep -q "Input/output error" "$FULL"; then
agree ..
>
>> + echo "Test complete"
>> + else
>> + echo "Test failed"
>> + fi
>> +
>> + restore_fi_settings
>> + echo "${io_fimeout_fail}" > /sys/block/"${nvme_ns}"/io-timeout-fail
>> +}
>> diff --git a/tests/nvme/050.out b/tests/nvme/050.out
>> new file mode 100644
>> index 0000000..b78b05f
>> --- /dev/null
>> +++ b/tests/nvme/050.out
>> @@ -0,0 +1,2 @@
>> +Running nvme/050
>> +Test complete
>> --
>> 2.40.0
Thanks for the review comments, I'll send out V3 with your
suggestions ..
-ck
More information about the Linux-nvme
mailing list