[PATCH 00/35] RFC: add "nvme monitor" subcommand

Sagi Grimberg sagi at grimberg.me
Fri Jan 29 15:08:08 EST 2021


>>>
>>> This method for discovery and autodetection has some advantages
>>> over the
>>> current udev-rule based approach:
>>>
>>>    * By using the `--persistent` option, users can easily control
>>> whether
>>>      persistent discovery controllers for discovered transport
>>> addresses should
>>>      be created and monitored for AEN events. **nvme monitor**
>>> watches known
>>>      transport addresses, creates discovery controllers as required,
>>> and re-uses
>>>      existing ones if possible.
>>
>> What does that mean?
> 
> In general, if the monitor detects a new host_traddr/traddr/trsvcid
> tuple, it runs a discovery on it, and keeps the discovery controller
> open if --persistent was given. On startup, it scans existing
> controllers, and if it finds already existing discovery controllers,
> re-uses them. These will not be shut down when the monitor exits.

And if it doesn't run with --persistent it deletes them? even if these
weren't created by it?

And, if none exist where does it get new discovery controller details?
Today we use discovery.conf for that.

> This allows users fine-grained control about what discovery controllers
> to operate persistently. Users who want all discovery controllers to be
> persistent just use --persistent. Others can set up those that they
> want to have (manually or with a script), and not use --persistent.
> 
> The background is that hosts may not need every detected discovery
> controller to be persistent. In multipath scenarios, you may see more
> discovery subsystems than anything else, and not everyone likes that.
> That's a generic issue and unrelated to the monitor, but running the
> monitor with --persistent creates discovery controllers that would
> otherwise not be visible.
> 
> Hope this clarifies it.

Well, How do people expect to know if stuff change without a persistent
discovery controller? Not sure if people may see more discovery
controllers this is a real issue given what they are getting from it.

But OK.

>>>    * In certain situations, the systemd-based approach may miss
>>> events due to
>>>      race conditions. This can happen e.g. if an FC remote port is
>>> detected, but
>>>      shortly after it's detection an FC relogin procedure is
>>> necessary e.g. due to
>>>      an RSCN. In this case, an `fc_udev_device` uevent will be
>>> received on the
>>>      first detection and handled by an `nvme connect-all` command
>>> run from
>>>      `nvmf-connect at .service`. The connection attempt to the rport in
>>> question will
>>>      fail with "no such device" because of the simultaneous FC
>>>      relogin. `nvmf-connect at .service` may not terminate immediately,
>>> because it
>>>      attempts to establish other connections listed in the Discovery
>>> Log page it
>>>      retrieved. When the FC relogin eventually finishes, a new
>>> uevent will be
>>>      received, and `nvmf-connect@` will be started again, but *this
>>> has no effect*
>>>      if the previous `nvmf-connect@` service hasn't finished yet.
>>> This is the
>>>      general semantics of systemd services, no easy workaround
>>> exists.  **nvme
>>>      monitor** doesn't suffer from this problem. If it sees an
>>> uevent for a
>>>      transport address for which a discovery is already running, it
>>> will queue the
>>>      handling of this event up and restart the discovery after it's
>>> finished.
>>
>> While I understand the issue, this reason alone is an overkill for
>> doing
>> this.
> 
> I agree. But it's not easy to fix the issue otherwise. In the customer
> problem where we observed it, I worked around it by adding the udev
> seqnum to the "instance name" of the systemd service, thus allowing
> several "nvme connect-all" processes to run for the same transport
> address simultaneously. But I don't think that would scale well;
> the monitor can handle it more cleanly.

Still changing the entire thing for a corner case...

> 
>>>    * Resource consumption for handling uevents is lower. Instead of
>>> running an
>>>      udev worker, executing the rules, executing `systemctl start`
>>> from the
>>>      worker, starting a systemd service, and starting a separate
>>> **nvme-cli**
>>>      instance, only a single `fork()` operation is necessary. Of
>>> course, on the
>>>      back side, the monitor itself consumes resources while it's
>>> running and
>>>      waiting for events. On my system with 8 persistent discovery
>>> controllers,
>>>      its RSS is ~3MB. CPU consumption is zero as long as no events
>>> occur.
>>
>> What is the baseline with what we have today?
> 
> A meaningful comparsion is difficult and should be done when the
> monitor functionality is finalized. I made this statement only to
> provide a rough idea of the resource usage, not more.

You mention that the utilization is lower than what we do today, hence
my question is by how much?

>>>    * **nvme monitor** could be easily extended to handle events for
>>> non-FC
>>>      transports.
>>
>> Which events?
> 
> Network discovery, mDNS or the like. I haven't digged into the details
> yet.

Yes, that is possible, would probably be easier to do this in a higher
level language but libavahi can also do this...

>>> I've tested `fc_udev_device` handling for NVMeoFC with an Ontap
>>> target, and
>>> AEN handling for RDMA using a Linux **nvmet** target.
>>>
>>> ### Implementation notes
>>>
>>> I've tried to change the exisiting **nvme-cli** code as little as
>>> possible
>>> while reusing the code from `fabrics.c`. The majority of changes in
>>> the
>>> existing code exports formerly static functions and variables, so
>>> that they
>>> are usable from the monitor code.
>>
>> General comment, can you please separate out fixes/cleanups that are
>> not
>> related to the goal of this patchset?
> 
> Which ones are you referring to? 09 and 19? While these are minor
> improvements to the existing code, I wouldn't say they qualify as fixes
> or cleanups. They aren't necessary without adding the monitor code.

Patches that are not directly related to the goal of this should
be splitted out to ease the review.

> But yes, I can post all changes to existing code separately.
> 
>>>    *  When "add" uevents for nvme controller devices are received,
>>> the
>>>       controller is consistently not in `live` state yet, and
>>> attempting to read
>>>       the `subsysnqn` sysfs attribute returns `(efault)`. While this
>>> should
>>>       arguably be fixed in the kernel, it could be worked around in
>>> user space
>>>       by using timers or polling the `state` sysfs attribute for
>>> changes.
>>
>> This is a bug, what in the code causes this? nothing in controller
>> state
>> should prevent from this sysfs read from executing correctly...
> 
> I think it can be fixed by making nvme_sysfs_show_subsysnqn() fall back
> to ctrl->opts->subsysnqn if ctrl->subsys is NULL.
> 
> I'll send a patch. Anyway, it'll take time until this is fixed
> everywhere.

Thanks.

> 
>>
>>>    * Parse and handle `discovery.conf` on startup.
>>
>> This is a must I think, where do you get the known transport
>> addresses
>> on startup today?
> 
> There's a systemd service that runs "nvme connect-all" once during
> boot. That exists today. I'm not sure if it should be integrated in the
> monitor, perhaps it's good to keep these separate. People who don't
> need the monitor can still run the existing service only, whereas for
> others, the two would play together just fine.

Then doesn't this service need to run after it?

>>>    * Implement support for RDMA and TCP protocols.
>>
>> What is needed for supporting them? Not sure I follow (I thought
>> you mentioned that you tested against linux nvmet-rdma?)
>>
> 
> AENs over existing discovery controllers are supported for all
> transports. But there's no support for discovery of new transports
> except for NVMeoFC's "fc_udev_device" mechanism.

I see. thanks.



More information about the Linux-nvme mailing list