[PATCH 00/35] RFC: add "nvme monitor" subcommand

Fri Jan 29 15:27:43 EST 2021

On Fri, 2021-01-29 at 12:08 -0800, Sagi Grimberg wrote:
> 
> > > > 
> > > > This method for discovery and autodetection has some advantages
> > > > over the
> > > > current udev-rule based approach:
> > > > 
> > > >    * By using the `--persistent` option, users can easily
> > > > control
> > > > whether
> > > >      persistent discovery controllers for discovered transport
> > > > addresses should
> > > >      be created and monitored for AEN events. **nvme monitor**
> > > > watches known
> > > >      transport addresses, creates discovery controllers as
> > > > required,
> > > > and re-uses
> > > >      existing ones if possible.
> > > 
> > > What does that mean?
> > 
> > In general, if the monitor detects a new host_traddr/traddr/trsvcid
> > tuple, it runs a discovery on it, and keeps the discovery
> > controller
> > open if --persistent was given. On startup, it scans existing
> > controllers, and if it finds already existing discovery
> > controllers,
> > re-uses them. These will not be shut down when the monitor exits.
> 
> And if it doesn't run with --persistent it deletes them? even if
> these
> weren't created by it?

No, not those. That's what I wanted to express with "These will not be
shut down when the monitor exits."

It's not trivial to determine reliably which controller the monitor
created and which it didn't. The current code assumes that all
discovery controllers that didn't exist at startup were created by the
monitor. We can improve the intelligence of the tool in that area, of
course. Currently it makes this dumb assumption.

> 
> And, if none exist where does it get new discovery controller
> details?
> Today we use discovery.conf for that.

As I said, this can be done with a separate "nvme connect-all" call,
which could be run from an ExecStartPre or from a separate service.
Or we can integrate the functionality in the monitor. I don't have a
strong opinion either way. I can adapt to what you guys prefer.

> 
> > This allows users fine-grained control about what discovery
> > controllers
> > to operate persistently. Users who want all discovery controllers
> > to be
> > persistent just use --persistent. Others can set up those that they
> > want to have (manually or with a script), and not use --persistent.
> > 
> > The background is that hosts may not need every detected discovery
> > controller to be persistent. In multipath scenarios, you may see
> > more
> > discovery subsystems than anything else, and not everyone likes
> > that.
> > That's a generic issue and unrelated to the monitor, but running
> > the
> > monitor with --persistent creates discovery controllers that would
> > otherwise not be visible.
> > 
> > Hope this clarifies it.
> 
> Well, How do people expect to know if stuff change without a
> persistent
> discovery controller? Not sure if people may see more discovery
> controllers this is a real issue given what they are getting from it.

You're right. At this very early stage, I wanted to give users the
freedom of choice, and not force dozens of discovery controller
connections upon everyone. If there's consensus that the discovery
connections should always be created, fine with me. 

> > I agree. But it's not easy to fix the issue otherwise. In the
> > customer
> > problem where we observed it, I worked around it by adding the udev
> > seqnum to the "instance name" of the systemd service, thus allowing
> > several "nvme connect-all" processes to run for the same transport
> > address simultaneously. But I don't think that would scale well;
> > the monitor can handle it more cleanly.
> 
> Still changing the entire thing for a corner case...

It's just one puzzle piece.

> > > >    * Resource consumption for handling uevents is lower.
> > > > Instead of
> > > > running an
> > > >      udev worker, executing the rules, executing `systemctl
> > > > start`
> > > > from the
> > > >      worker, starting a systemd service, and starting a
> > > > separate
> > > > **nvme-cli**
> > > >      instance, only a single `fork()` operation is necessary.
> > > > Of
> > > > course, on the
> > > >      back side, the monitor itself consumes resources while
> > > > it's
> > > > running and
> > > >      waiting for events. On my system with 8 persistent
> > > > discovery
> > > > controllers,
> > > >      its RSS is ~3MB. CPU consumption is zero as long as no
> > > > events
> > > > occur.
> > > 
> > > What is the baseline with what we have today?
> > 
> > A meaningful comparsion is difficult and should be done when the
> > monitor functionality is finalized. I made this statement only to
> > provide a rough idea of the resource usage, not more.
> 
> You mention that the utilization is lower than what we do today,
> hence
> my question is by how much?

I'll try to figure it out and come up with something. Sorry for the
hand-waving assertion. I admint It was speculative for the most part.

> > > >    * **nvme monitor** could be easily extended to handle events
> > > > for
> > > > non-FC
> > > >      transports.
> > > 
> > > Which events?
> > 
> > Network discovery, mDNS or the like. I haven't digged into the
> > details
> > yet.
> 
> Yes, that is possible, would probably be easier to do this in a
> higher
> level language but libavahi can also do this...

Enzo Matsumiya has been working on it, and we've been discussing how to
integrate his functionality into the monitor.

> > > 
> > > >    * Parse and handle `discovery.conf` on startup.
> > > 
> > > This is a must I think, where do you get the known transport
> > > addresses
> > > on startup today?
> > 
> > There's a systemd service that runs "nvme connect-all" once during
> > boot. That exists today. I'm not sure if it should be integrated in
> > the
> > monitor, perhaps it's good to keep these separate. People who don't
> > need the monitor can still run the existing service only, whereas
> > for
> > others, the two would play together just fine.
> 
> Then doesn't this service need to run after it?

Not necessarily. It can handle controllers that are created by other
processes while it's running.

Thanks for your comments,
Martin