race between nvme device creation and discovery?

Daniel Wagner dwagner at suse.de
Fri Feb 2 07:16:33 PST 2024


I am trying to figure out why some of the blktests fail randomly when
running with FC as transport. This failure only appear when the
autoconnect is running in the background. A clear indication we still
have some sort of interference with it.

nvme/030 fails a bit more often then the rest, and it might just because
it issues several 'nvme discover' commands, many other tests only a one.

When a test fails, it fails with

  failed to lookup subsystem for controller nvme0

which is from libnvme when it iterates over sysfs to gather infos.

        subsysname = nvme_ctrl_lookup_subsystem_name(r, name);
        if (!subsysname) {
                nvme_msg(r, LOG_ERR,
                         "failed to lookup subsystem for controller %s\n",
                         name);
                errno = ENXIO;
                return NULL;
        }

My current theory is when a new controller isa dded is not atomic from
the POV userland and thus libnvme is able to observe a situation when
there is controller but the matching subsystem is not yet visible.

So something like:

  nvme_init_ctrl
    cdev_device_add

  // libnvme iterates over sysfs

  nvme_init_ctrl_finish
    nvme_init_identify
      nvme_init_subsystem
         device_add          // nvme-subsys%d
         sysfs_create_link   // subsys->dev -> ctrl-device

Does this any sense? And if so what could be done? Should we add some
retry logic to libnvme?

Daniel



More information about the Linux-nvme mailing list