[PATCH 00/35] RFC: add "nvme monitor" subcommand

Thu Jan 28 20:14:23 EST 2021

> From: Martin Wilck <mwilck at suse.com>
> 
> (Cover letter copied from https://github.com/linux-nvme/nvme-cli/pull/877)
> 
> This patch set adds a new subcommand **nvme monitor**. In this mode,
> **nvme-cli** runs continuously, monitors events (currently, uevents) relevant
> for discovery, and optionally autoconnects to newly discovered subsystems.
> 
> The monitor mode is suitable to be run in a systemd service. An appropriate
> unit file is provided. As such, **nvme monitor** can be used as an alternative
> to the current auto-connection mechanism based on udev rules and systemd
> template units. If `--autoconnect` is active, **nvme monitor** masks the
> respective udev rules in order to prevent simultaneous connection attempts
> from udev and itself.

I think that a monitor daemon is a good path forward.

> 
> This method for discovery and autodetection has some advantages over the
> current udev-rule based approach:
> 
>   * By using the `--persistent` option, users can easily control whether
>     persistent discovery controllers for discovered transport addresses should
>     be created and monitored for AEN events. **nvme monitor** watches known
>     transport addresses, creates discovery controllers as required, and re-uses
>     existing ones if possible.

What does that mean?

  It keeps track of persistent discovery
>     controllers it created, and tears them down on exit. When run in
>     `--persistent --autoconnect` mode *in the initial ramfs*, it will keep
>     discovery controllers alive, so that a new instance started after switching
>     root can simply re-use them.

Nice.

>   * In certain situations, the systemd-based approach may miss events due to
>     race conditions. This can happen e.g. if an FC remote port is detected, but
>     shortly after it's detection an FC relogin procedure is necessary e.g. due to
>     an RSCN. In this case, an `fc_udev_device` uevent will be received on the
>     first detection and handled by an `nvme connect-all` command run from
>     `nvmf-connect at .service`. The connection attempt to the rport in question will
>     fail with "no such device" because of the simultaneous FC
>     relogin. `nvmf-connect at .service` may not terminate immediately, because it
>     attempts to establish other connections listed in the Discovery Log page it
>     retrieved. When the FC relogin eventually finishes, a new uevent will be
>     received, and `nvmf-connect@` will be started again, but *this has no effect*
>     if the previous `nvmf-connect@` service hasn't finished yet. This is the
>     general semantics of systemd services, no easy workaround exists.  **nvme
>     monitor** doesn't suffer from this problem. If it sees an uevent for a
>     transport address for which a discovery is already running, it will queue the
>     handling of this event up and restart the discovery after it's finished.

While I understand the issue, this reason alone is an overkill for doing 
this.

>   * Resource consumption for handling uevents is lower. Instead of running an
>     udev worker, executing the rules, executing `systemctl start` from the
>     worker, starting a systemd service, and starting a separate **nvme-cli**
>     instance, only a single `fork()` operation is necessary. Of course, on the
>     back side, the monitor itself consumes resources while it's running and
>     waiting for events. On my system with 8 persistent discovery controllers,
>     its RSS is ~3MB. CPU consumption is zero as long as no events occur.

What is the baseline with what we have today?

>   * **nvme monitor** could be easily extended to handle events for non-FC
>     transports.

Which events?

> 
> I've tested `fc_udev_device` handling for NVMeoFC with an Ontap target, and
> AEN handling for RDMA using a Linux **nvmet** target.
> 
> ### Implementation notes
> 
> I've tried to change the exisiting **nvme-cli** code as little as possible
> while reusing the code from `fabrics.c`. The majority of changes in the
> existing code exports formerly static functions and variables, so that they
> are usable from the monitor code.

General comment, can you please separate out fixes/cleanups that are not
related to the goal of this patchset?

> 
> The main process just waits for events using `epoll()`. When an event is
> received that necessitates a new discovery, a child is forked. This makes it
> possible to fill in the configuration parameters for `do_discover()` without
> interfering with the main process or other discovery tasks running in
> parallel. The program tracks *transport addresses* (called "connections" in
> the code) rather than NVMe controllers. In `--persistent` mode, it tries to
> maintain exactly one persistent discovery connection per transport address.
> 
> Using `epoll()` may look over-engineered at this stage. I hope the better
> flexibility over `poll()` (in particular, the ability to add new event sources
> while waiting) will simplify future extensions and improvements.
> 
> ### Todo
> 
>   * Referrals are not handled perfectly yet. They will be handled by
>     `do_discover()` just as it would when called from **nvme connect-all**, but
>     it would be better to pass referrals back to the main process to make it
>     aware of the additional discovery controller rather than using
>     recursion. The main process would e.g. know if a discovery is already
>     running for the transport address in the referrral.

Agree, this way you can also keep new referrals persistently, which I
don't know if we do today.

>   *  When "add" uevents for nvme controller devices are received, the
>      controller is consistently not in `live` state yet, and attempting to read
>      the `subsysnqn` sysfs attribute returns `(efault)`. While this should
>      arguably be fixed in the kernel, it could be worked around in user space
>      by using timers or polling the `state` sysfs attribute for changes.

This is a bug, what in the code causes this? nothing in controller state
should prevent from this sysfs read from executing correctly...

>   * Parse and handle `discovery.conf` on startup.

This is a must I think, where do you get the known transport addresses
on startup today?

>   * Implement support for RDMA and TCP protocols.

What is needed for supporting them? Not sure I follow (I thought
you mentioned that you tested against linux nvmet-rdma?)