blktests failures with v6.11-rc1 kernel
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Mon Aug 19 05:34:51 PDT 2024
On Aug 14, 2024 / 18:05, Nilay Shroff wrote:
>
>
> On 8/13/24 12:36, Yi Zhang wrote:
> > On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay at linux.ibm.com> wrote:
> >
> > There are no simultaneous tests during the CKI tests running.
> > I reproduced the failure on that server and always can be reproduced
> > within 5 times:
> > # sh a.sh
> > ==============================0
> > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
> > runtime 21.496s ... 21.398s
> > ==============================1
> > nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> > runtime 21.398s ... 21.974s
> > --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
> > +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
> > 2024-08-13 02:53:51.635047928 -0400
> > @@ -1,2 +1,5 @@
> > Running nvme/052
> > +cat: /sys/block/nvme1n2/uuid: No such file or directory
> > +cat: /sys/block/nvme1n2/uuid: No such file or directory
> > +cat: /sys/block/nvme1n2/uuid: No such file or directory
> > Test complete
> > # uname -r
> > 6.11.0-rc3
>
> We may need to debug this further. Is it possible to patch blktest and
> collect some details when this issue manifests? If yes then can you please
> apply the below diff and re-run your test? This patch would capture output
> of "nvme list" and "sysfs attribute tree created under namespace head node"
> and store those details in 052.full file.
>
> diff --git a/common/nvme b/common/nvme
> index 9e78f3e..780b5e3 100644
> --- a/common/nvme
> +++ b/common/nvme
> @@ -589,8 +589,23 @@ _find_nvme_ns() {
> if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
> continue
> fi
> + echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
> + echo -e "\n`nvme list -v`\n" >> ${FULL}
> + echo -e "\n`tree ${ns}`\n" >> ${FULL}
> +
> [ -e "${ns}/uuid" ] || continue
> uuid=$(cat "${ns}/uuid")
> +
> + if [ "$?" = "1" ]; then
> + echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
> + echo "`nvme list -v`" >> ${FULL}
> + if [ -d "${ns}" ]; then
> + echo -e "\n`tree ${ns}`\n" >> ${FULL}
> + else
> + echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
> + fi
> + fi
> +
> if [[ "${subsys_uuid}" == "${uuid}" ]]; then
> basename "${ns}"
> fi
>
>
> After applying the above diff, when this issue occurs on your system copy this
> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across.
> This may give us some clue about what might be going wrong.
Nilay, thank you for this suggestion. To follow it, I tried to recreate the
failure again, and managed to do it :) When I repeat the test case 20 or 40
times one of my test machines, the failure is observed in stable manner.
I applied your debug patch above to blktests, then I repeated the test case.
Unfortunately, the failure disappeared. When I repeat the test case 100 times,
the failure was not observed. I guess the echos for debug changed the timing to
access the sysfs uuid file, then the failure disappeared.
This helped me think about the cause. The test case repeats _create_nvmet_ns
and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid
file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to
remove the namespace, it does not wait for the completion of the removal work.
Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to
the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat
to the sysfs uuid file, it fails because the sysfs uuid file has got removed,
before recreating it for the next _create_nvmet_ns.
Based on this guess, I created a patch below. It modifies the test case to wait
for the namespace device disappears after calling _remove_nvmet_ns. (I assume
that the sysfs uuid file disappears when the device file disappears). With
this patch, the failure was not observed by repeating it 100 times. I also
reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list")
from v6.11-rc4, then confirmed that the test case with this change still can
detect the regression.
I will do some more confirmation. If it goes well, will post this change as
a formal patch.
diff --git a/tests/nvme/052 b/tests/nvme/052
index cf6061a..469cefd 100755
--- a/tests/nvme/052
+++ b/tests/nvme/052
@@ -39,15 +39,32 @@ nvmf_wait_for_ns() {
ns=$(_find_nvme_ns "${uuid}")
done
+ echo "$ns"
return 0
}
+nvmf_wait_for_ns_removal() {
+ local ns=$1 i
+
+ for ((i = 0; i < 10; i++)); do
+ if [[ ! -e /dev/$ns ]]; then
+ return
+ fi
+ sleep .1
+ echo "wait removal of $ns" >> "$FULL"
+ done
+
+ if [[ -e /dev/$ns ]]; then
+ echo "Failed to remove the namespace $"
+ fi
+}
+
test() {
echo "Running ${TEST_NAME}"
_setup_nvmet
- local iterations=20
+ local iterations=20 ns
_nvmet_target_setup
@@ -63,7 +80,7 @@ test() {
_create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}"
# wait until async request is processed and ns is created
- nvmf_wait_for_ns "${uuid}"
+ ns=$(nvmf_wait_for_ns "${uuid}")
if [ $? -eq 1 ]; then
echo "FAIL"
rm "$(_nvme_def_file_path).$i"
@@ -71,6 +88,7 @@ test() {
fi
_remove_nvmet_ns "${def_subsysnqn}" "${i}"
+ nvmf_wait_for_ns_removal "$ns"
rm "$(_nvme_def_file_path).$i"
}
done
More information about the Linux-nvme
mailing list