race between nvme device creation and discovery?
Daniel Wagner
dwagner at suse.de
Fri Feb 2 07:16:33 PST 2024
I am trying to figure out why some of the blktests fail randomly when
running with FC as transport. This failure only appear when the
autoconnect is running in the background. A clear indication we still
have some sort of interference with it.
nvme/030 fails a bit more often then the rest, and it might just because
it issues several 'nvme discover' commands, many other tests only a one.
When a test fails, it fails with
failed to lookup subsystem for controller nvme0
which is from libnvme when it iterates over sysfs to gather infos.
subsysname = nvme_ctrl_lookup_subsystem_name(r, name);
if (!subsysname) {
nvme_msg(r, LOG_ERR,
"failed to lookup subsystem for controller %s\n",
name);
errno = ENXIO;
return NULL;
}
My current theory is when a new controller isa dded is not atomic from
the POV userland and thus libnvme is able to observe a situation when
there is controller but the matching subsystem is not yet visible.
So something like:
nvme_init_ctrl
cdev_device_add
// libnvme iterates over sysfs
nvme_init_ctrl_finish
nvme_init_identify
nvme_init_subsystem
device_add // nvme-subsys%d
sysfs_create_link // subsys->dev -> ctrl-device
Does this any sense? And if so what could be done? Should we add some
retry logic to libnvme?
Daniel
More information about the Linux-nvme
mailing list