[PATCH v5 00/14] nvmet-fcloop: track resources via reference counting

Tue May 6 09:36:14 PDT 2025

On Wed, Apr 23, 2025 at 03:21:43PM +0200, Daniel Wagner wrote:
> Note blktests nvme/030 test is likely to fail if the
> 70-nvmf-autoconnect.rules is active. In this case two discovery are
> running in parallel and nvme-cli/libnvme gets out of sync. I don't see a
> problem in blktests, but maybe I am just blind:
> 
> nvme/030 (tr=fc) (ensure the discovery generation counter is updated appropriately) [failed]
>     runtime  1.843s  ...  1.719s
>     --- tests/nvme/030.out      2023-08-30 08:39:08.428409596 +0000
>     +++ /tmp/blktests/nodev_tr_fc/nvme/030.out.bad      2025-04-10 10:56:05.146372112 +0000
>     @@ -1,2 +1,6 @@
>      Running nvme/030
>     +Failed to open ctrl nvme1, errno 11
>     +Failed to open ctrl nvme1, errno 11
>     +failed to get discovery log: Bad file descriptor

It turns out that nvme/030 uncovered a bunch of bugs. First the kernel
returned EAGAIN consistently for a while and could easily reproduce it.
But after updating something it went away. I think the EAGAIN was issued
because in my test setup the udev rule is active and triggers a discover
(creates a discover ctrl) which runs in parallel with the test, also
running a discover. I think EAGAIN was alwasys there but it is hard to
hit.

I've added a workaround to handle EINTR to libnvme but after reading up
on signals, I came to the conclusion, nvme-cli needs to handle EAGAIN
and EINTR. The EINTR case might be entered with Ctrl-C and in this case
we want to terminate the loop. Installing a signal handler in a library
is a no go from my understanding:

  https://github.com/linux-nvme/nvme-cli/pull/2797

After getting this out of the way, I figured out that nvmet-fc is not
able to handle more than on in flight async, and there is a nested
locking issue in nvme-fc.

The tests get more and more reliable, though I thought I saw a KASAN
report but now it's not reproducing. Yeah, everyone loves heisenbugs.