blktests failures with v6.11-rc1 kernel
Yi Zhang
yi.zhang at redhat.com
Mon Aug 19 18:50:47 PDT 2024
On Mon, Aug 19, 2024 at 9:35 PM Nilay Shroff <nilay at linux.ibm.com> wrote:
>
>
>
> On 8/19/24 18:04, Shinichiro Kawasaki wrote:
> > On Aug 14, 2024 / 18:05, Nilay Shroff wrote:
> >>
> >>
> >> On 8/13/24 12:36, Yi Zhang wrote:
> >>> On Sat, Aug 3, 2024 at 12:49 AM Nilay Shroff <nilay at linux.ibm.com> wrote:
> >>>
> >>> There are no simultaneous tests during the CKI tests running.
> >>> I reproduced the failure on that server and always can be reproduced
> >>> within 5 times:
> >>> # sh a.sh
> >>> ==============================0
> >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [passed]
> >>> runtime 21.496s ... 21.398s
> >>> ==============================1
> >>> nvme/052 (tr=loop) (Test file-ns creation/deletion under one subsystem) [failed]
> >>> runtime 21.398s ... 21.974s
> >>> --- tests/nvme/052.out 2024-08-10 00:30:06.989814226 -0400
> >>> +++ /root/blktests/results/nodev_tr_loop/nvme/052.out.bad
> >>> 2024-08-13 02:53:51.635047928 -0400
> >>> @@ -1,2 +1,5 @@
> >>> Running nvme/052
> >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>> +cat: /sys/block/nvme1n2/uuid: No such file or directory
> >>> Test complete
> >>> # uname -r
> >>> 6.11.0-rc3
> >>
> >> We may need to debug this further. Is it possible to patch blktest and
> >> collect some details when this issue manifests? If yes then can you please
> >> apply the below diff and re-run your test? This patch would capture output
> >> of "nvme list" and "sysfs attribute tree created under namespace head node"
> >> and store those details in 052.full file.
> >>
> >> diff --git a/common/nvme b/common/nvme
> >> index 9e78f3e..780b5e3 100644
> >> --- a/common/nvme
> >> +++ b/common/nvme
> >> @@ -589,8 +589,23 @@ _find_nvme_ns() {
> >> if ! [[ "${ns}" =~ nvme[0-9]+n[0-9]+ ]]; then
> >> continue
> >> fi
> >> + echo -e "\nBefore ${ns}/uuid check:\n" >> ${FULL}
> >> + echo -e "\n`nvme list -v`\n" >> ${FULL}
> >> + echo -e "\n`tree ${ns}`\n" >> ${FULL}
> >> +
> >> [ -e "${ns}/uuid" ] || continue
> >> uuid=$(cat "${ns}/uuid")
> >> +
> >> + if [ "$?" = "1" ]; then
> >> + echo -e "\nFailed to read $ns/uuid\n" >> ${FULL}
> >> + echo "`nvme list -v`" >> ${FULL}
> >> + if [ -d "${ns}" ]; then
> >> + echo -e "\n`tree ${ns}`\n" >> ${FULL}
> >> + else
> >> + echo -e "\n${ns} doesn't exist!\n" >> ${FULL}
> >> + fi
> >> + fi
> >> +
> >> if [[ "${subsys_uuid}" == "${uuid}" ]]; then
> >> basename "${ns}"
> >> fi
> >>
> >>
> >> After applying the above diff, when this issue occurs on your system copy this
> >> file "</path/to/blktests>/results/nodev_tr_loop/nvme/052.full" and send it across.
> >> This may give us some clue about what might be going wrong.
> >
> > Nilay, thank you for this suggestion. To follow it, I tried to recreate the
> > failure again, and managed to do it :) When I repeat the test case 20 or 40
> > times one of my test machines, the failure is observed in stable manner.
>
> Shinichiro, I am glad that you were able to recreate this issue.
>
> > I applied your debug patch above to blktests, then I repeated the test case.
> > Unfortunately, the failure disappeared. When I repeat the test case 100 times,
> > the failure was not observed. I guess the echos for debug changed the timing to
> > access the sysfs uuid file, then the failure disappeared.
>
> Yes this could be possible. BTW, Yi tried the same patch and with the patch applied,
> this issue could be still reproduced on Yi's testbed!!
> > This helped me think about the cause. The test case repeats _create_nvmet_ns
> > and _remove_nvmet_ns. Then, it repeats creating and removing the sysfs uuid
> > file. I guess when _remove_nvmet_ns echos 0 to ${nvemt_ns_path}/enable to
> > remove the namespace, it does not wait for the completion of the removal work.
> > Then, when _find_nvme_ns() checks existence of the sysfs uuid file, it refers to
> > the sysfs uuid file that the previous _remove_nvmet_ns left. When it does cat
> > to the sysfs uuid file, it fails because the sysfs uuid file has got removed,
> > before recreating it for the next _create_nvmet_ns.
>
> I agree with your assessment about the plausible cause of this issue. I just reviewed
> the nvme target kernel code and it's now apparent to me that we need to wait for the
> removal of the namespace before we re-create the next namespace. I think this is a miss.
> >
> > Based on this guess, I created a patch below. It modifies the test case to wait
> > for the namespace device disappears after calling _remove_nvmet_ns. (I assume
> > that the sysfs uuid file disappears when the device file disappears). With
> > this patch, the failure was not observed by repeating it 100 times. I also
> > reverted the kernel commit ff0ffe5b7c3c ("nvme: fix namespace removal list")
> > from v6.11-rc4, then confirmed that the test case with this change still can
> > detect the regression.
> >
> I am pretty sure that your patch would solve this issue.
>
> > I will do some more confirmation. If it goes well, will post this change as
> > a formal patch.
> >
> > diff --git a/tests/nvme/052 b/tests/nvme/052
> > index cf6061a..469cefd 100755
> > --- a/tests/nvme/052
> > +++ b/tests/nvme/052
> > @@ -39,15 +39,32 @@ nvmf_wait_for_ns() {
> > ns=$(_find_nvme_ns "${uuid}")
> > done
> >
> > + echo "$ns"
> > return 0
> > }
> >
> > +nvmf_wait_for_ns_removal() {
> > + local ns=$1 i
> > +
> > + for ((i = 0; i < 10; i++)); do
> > + if [[ ! -e /dev/$ns ]]; then
> > + return
> > + fi
> > + sleep .1
> > + echo "wait removal of $ns" >> "$FULL"
> > + done
> > +
> > + if [[ -e /dev/$ns ]]; then
> > + echo "Failed to remove the namespace $"
> > + fi
> > +}
> > +
> > test() {
> > echo "Running ${TEST_NAME}"
> >
> > _setup_nvmet
> >
> > - local iterations=20
> > + local iterations=20 ns
> >
> > _nvmet_target_setup
> >
> > @@ -63,7 +80,7 @@ test() {
> > _create_nvmet_ns "${def_subsysnqn}" "${i}" "$(_nvme_def_file_path).$i" "${uuid}"
> >
> > # wait until async request is processed and ns is created
> > - nvmf_wait_for_ns "${uuid}"
> > + ns=$(nvmf_wait_for_ns "${uuid}")
> > if [ $? -eq 1 ]; then
> > echo "FAIL"
> > rm "$(_nvme_def_file_path).$i"
> > @@ -71,6 +88,7 @@ test() {
> > fi
> >
> > _remove_nvmet_ns "${def_subsysnqn}" "${i}"
> > + nvmf_wait_for_ns_removal "$ns"
> > rm "$(_nvme_def_file_path).$i"
> > }
> > done
>
> I think there's some formatting issue in the above patch. I see some stray characters
> which you may cleanup/fix later when you send the formal patch.
>
> Yi, I think you you may also try the above patch on your testbed and confirm the result.
Nilay/Shinichiro
Confirmed the failure cannot be reproduced with this patch now.
>
> Thanks,
> --Nilay
>
--
Best Regards,
Yi Zhang
More information about the Linux-nvme
mailing list