[PATCH 00/35] RFC: add "nvme monitor" subcommand
Sagi Grimberg
sagi at grimberg.me
Thu Jan 28 20:14:23 EST 2021
> From: Martin Wilck <mwilck at suse.com>
>
> (Cover letter copied from https://github.com/linux-nvme/nvme-cli/pull/877)
>
> This patch set adds a new subcommand **nvme monitor**. In this mode,
> **nvme-cli** runs continuously, monitors events (currently, uevents) relevant
> for discovery, and optionally autoconnects to newly discovered subsystems.
>
> The monitor mode is suitable to be run in a systemd service. An appropriate
> unit file is provided. As such, **nvme monitor** can be used as an alternative
> to the current auto-connection mechanism based on udev rules and systemd
> template units. If `--autoconnect` is active, **nvme monitor** masks the
> respective udev rules in order to prevent simultaneous connection attempts
> from udev and itself.
I think that a monitor daemon is a good path forward.
>
> This method for discovery and autodetection has some advantages over the
> current udev-rule based approach:
>
> * By using the `--persistent` option, users can easily control whether
> persistent discovery controllers for discovered transport addresses should
> be created and monitored for AEN events. **nvme monitor** watches known
> transport addresses, creates discovery controllers as required, and re-uses
> existing ones if possible.
What does that mean?
It keeps track of persistent discovery
> controllers it created, and tears them down on exit. When run in
> `--persistent --autoconnect` mode *in the initial ramfs*, it will keep
> discovery controllers alive, so that a new instance started after switching
> root can simply re-use them.
Nice.
> * In certain situations, the systemd-based approach may miss events due to
> race conditions. This can happen e.g. if an FC remote port is detected, but
> shortly after it's detection an FC relogin procedure is necessary e.g. due to
> an RSCN. In this case, an `fc_udev_device` uevent will be received on the
> first detection and handled by an `nvme connect-all` command run from
> `nvmf-connect at .service`. The connection attempt to the rport in question will
> fail with "no such device" because of the simultaneous FC
> relogin. `nvmf-connect at .service` may not terminate immediately, because it
> attempts to establish other connections listed in the Discovery Log page it
> retrieved. When the FC relogin eventually finishes, a new uevent will be
> received, and `nvmf-connect@` will be started again, but *this has no effect*
> if the previous `nvmf-connect@` service hasn't finished yet. This is the
> general semantics of systemd services, no easy workaround exists. **nvme
> monitor** doesn't suffer from this problem. If it sees an uevent for a
> transport address for which a discovery is already running, it will queue the
> handling of this event up and restart the discovery after it's finished.
While I understand the issue, this reason alone is an overkill for doing
this.
> * Resource consumption for handling uevents is lower. Instead of running an
> udev worker, executing the rules, executing `systemctl start` from the
> worker, starting a systemd service, and starting a separate **nvme-cli**
> instance, only a single `fork()` operation is necessary. Of course, on the
> back side, the monitor itself consumes resources while it's running and
> waiting for events. On my system with 8 persistent discovery controllers,
> its RSS is ~3MB. CPU consumption is zero as long as no events occur.
What is the baseline with what we have today?
> * **nvme monitor** could be easily extended to handle events for non-FC
> transports.
Which events?
>
> I've tested `fc_udev_device` handling for NVMeoFC with an Ontap target, and
> AEN handling for RDMA using a Linux **nvmet** target.
>
> ### Implementation notes
>
> I've tried to change the exisiting **nvme-cli** code as little as possible
> while reusing the code from `fabrics.c`. The majority of changes in the
> existing code exports formerly static functions and variables, so that they
> are usable from the monitor code.
General comment, can you please separate out fixes/cleanups that are not
related to the goal of this patchset?
>
> The main process just waits for events using `epoll()`. When an event is
> received that necessitates a new discovery, a child is forked. This makes it
> possible to fill in the configuration parameters for `do_discover()` without
> interfering with the main process or other discovery tasks running in
> parallel. The program tracks *transport addresses* (called "connections" in
> the code) rather than NVMe controllers. In `--persistent` mode, it tries to
> maintain exactly one persistent discovery connection per transport address.
>
> Using `epoll()` may look over-engineered at this stage. I hope the better
> flexibility over `poll()` (in particular, the ability to add new event sources
> while waiting) will simplify future extensions and improvements.
>
> ### Todo
>
> * Referrals are not handled perfectly yet. They will be handled by
> `do_discover()` just as it would when called from **nvme connect-all**, but
> it would be better to pass referrals back to the main process to make it
> aware of the additional discovery controller rather than using
> recursion. The main process would e.g. know if a discovery is already
> running for the transport address in the referrral.
Agree, this way you can also keep new referrals persistently, which I
don't know if we do today.
> * When "add" uevents for nvme controller devices are received, the
> controller is consistently not in `live` state yet, and attempting to read
> the `subsysnqn` sysfs attribute returns `(efault)`. While this should
> arguably be fixed in the kernel, it could be worked around in user space
> by using timers or polling the `state` sysfs attribute for changes.
This is a bug, what in the code causes this? nothing in controller state
should prevent from this sysfs read from executing correctly...
> * Parse and handle `discovery.conf` on startup.
This is a must I think, where do you get the known transport addresses
on startup today?
> * Implement support for RDMA and TCP protocols.
What is needed for supporting them? Not sure I follow (I thought
you mentioned that you tested against linux nvmet-rdma?)
More information about the Linux-nvme
mailing list