[PATCH v5 00/14] nvmet-fcloop: track resources via reference counting
Daniel Wagner
dwagner at suse.de
Tue May 6 09:36:14 PDT 2025
On Wed, Apr 23, 2025 at 03:21:43PM +0200, Daniel Wagner wrote:
> Note blktests nvme/030 test is likely to fail if the
> 70-nvmf-autoconnect.rules is active. In this case two discovery are
> running in parallel and nvme-cli/libnvme gets out of sync. I don't see a
> problem in blktests, but maybe I am just blind:
>
> nvme/030 (tr=fc) (ensure the discovery generation counter is updated appropriately) [failed]
> runtime 1.843s ... 1.719s
> --- tests/nvme/030.out 2023-08-30 08:39:08.428409596 +0000
> +++ /tmp/blktests/nodev_tr_fc/nvme/030.out.bad 2025-04-10 10:56:05.146372112 +0000
> @@ -1,2 +1,6 @@
> Running nvme/030
> +Failed to open ctrl nvme1, errno 11
> +Failed to open ctrl nvme1, errno 11
> +failed to get discovery log: Bad file descriptor
It turns out that nvme/030 uncovered a bunch of bugs. First the kernel
returned EAGAIN consistently for a while and could easily reproduce it.
But after updating something it went away. I think the EAGAIN was issued
because in my test setup the udev rule is active and triggers a discover
(creates a discover ctrl) which runs in parallel with the test, also
running a discover. I think EAGAIN was alwasys there but it is hard to
hit.
I've added a workaround to handle EINTR to libnvme but after reading up
on signals, I came to the conclusion, nvme-cli needs to handle EAGAIN
and EINTR. The EINTR case might be entered with Ctrl-C and in this case
we want to terminate the loop. Installing a signal handler in a library
is a no go from my understanding:
https://github.com/linux-nvme/nvme-cli/pull/2797
After getting this out of the way, I figured out that nvmet-fc is not
able to handle more than on in flight async, and there is a nested
locking issue in nvme-fc.
The tests get more and more reliable, though I thought I saw a KASAN
report but now it's not reproducing. Yeah, everyone loves heisenbugs.
More information about the Linux-nvme
mailing list