[PATCH 00/35] RFC: add "nvme monitor" subcommand
Sagi Grimberg
sagi at grimberg.me
Fri Jan 29 15:08:08 EST 2021
>>>
>>> This method for discovery and autodetection has some advantages
>>> over the
>>> current udev-rule based approach:
>>>
>>> * By using the `--persistent` option, users can easily control
>>> whether
>>> persistent discovery controllers for discovered transport
>>> addresses should
>>> be created and monitored for AEN events. **nvme monitor**
>>> watches known
>>> transport addresses, creates discovery controllers as required,
>>> and re-uses
>>> existing ones if possible.
>>
>> What does that mean?
>
> In general, if the monitor detects a new host_traddr/traddr/trsvcid
> tuple, it runs a discovery on it, and keeps the discovery controller
> open if --persistent was given. On startup, it scans existing
> controllers, and if it finds already existing discovery controllers,
> re-uses them. These will not be shut down when the monitor exits.
And if it doesn't run with --persistent it deletes them? even if these
weren't created by it?
And, if none exist where does it get new discovery controller details?
Today we use discovery.conf for that.
> This allows users fine-grained control about what discovery controllers
> to operate persistently. Users who want all discovery controllers to be
> persistent just use --persistent. Others can set up those that they
> want to have (manually or with a script), and not use --persistent.
>
> The background is that hosts may not need every detected discovery
> controller to be persistent. In multipath scenarios, you may see more
> discovery subsystems than anything else, and not everyone likes that.
> That's a generic issue and unrelated to the monitor, but running the
> monitor with --persistent creates discovery controllers that would
> otherwise not be visible.
>
> Hope this clarifies it.
Well, How do people expect to know if stuff change without a persistent
discovery controller? Not sure if people may see more discovery
controllers this is a real issue given what they are getting from it.
But OK.
>>> * In certain situations, the systemd-based approach may miss
>>> events due to
>>> race conditions. This can happen e.g. if an FC remote port is
>>> detected, but
>>> shortly after it's detection an FC relogin procedure is
>>> necessary e.g. due to
>>> an RSCN. In this case, an `fc_udev_device` uevent will be
>>> received on the
>>> first detection and handled by an `nvme connect-all` command
>>> run from
>>> `nvmf-connect at .service`. The connection attempt to the rport in
>>> question will
>>> fail with "no such device" because of the simultaneous FC
>>> relogin. `nvmf-connect at .service` may not terminate immediately,
>>> because it
>>> attempts to establish other connections listed in the Discovery
>>> Log page it
>>> retrieved. When the FC relogin eventually finishes, a new
>>> uevent will be
>>> received, and `nvmf-connect@` will be started again, but *this
>>> has no effect*
>>> if the previous `nvmf-connect@` service hasn't finished yet.
>>> This is the
>>> general semantics of systemd services, no easy workaround
>>> exists. **nvme
>>> monitor** doesn't suffer from this problem. If it sees an
>>> uevent for a
>>> transport address for which a discovery is already running, it
>>> will queue the
>>> handling of this event up and restart the discovery after it's
>>> finished.
>>
>> While I understand the issue, this reason alone is an overkill for
>> doing
>> this.
>
> I agree. But it's not easy to fix the issue otherwise. In the customer
> problem where we observed it, I worked around it by adding the udev
> seqnum to the "instance name" of the systemd service, thus allowing
> several "nvme connect-all" processes to run for the same transport
> address simultaneously. But I don't think that would scale well;
> the monitor can handle it more cleanly.
Still changing the entire thing for a corner case...
>
>>> * Resource consumption for handling uevents is lower. Instead of
>>> running an
>>> udev worker, executing the rules, executing `systemctl start`
>>> from the
>>> worker, starting a systemd service, and starting a separate
>>> **nvme-cli**
>>> instance, only a single `fork()` operation is necessary. Of
>>> course, on the
>>> back side, the monitor itself consumes resources while it's
>>> running and
>>> waiting for events. On my system with 8 persistent discovery
>>> controllers,
>>> its RSS is ~3MB. CPU consumption is zero as long as no events
>>> occur.
>>
>> What is the baseline with what we have today?
>
> A meaningful comparsion is difficult and should be done when the
> monitor functionality is finalized. I made this statement only to
> provide a rough idea of the resource usage, not more.
You mention that the utilization is lower than what we do today, hence
my question is by how much?
>>> * **nvme monitor** could be easily extended to handle events for
>>> non-FC
>>> transports.
>>
>> Which events?
>
> Network discovery, mDNS or the like. I haven't digged into the details
> yet.
Yes, that is possible, would probably be easier to do this in a higher
level language but libavahi can also do this...
>>> I've tested `fc_udev_device` handling for NVMeoFC with an Ontap
>>> target, and
>>> AEN handling for RDMA using a Linux **nvmet** target.
>>>
>>> ### Implementation notes
>>>
>>> I've tried to change the exisiting **nvme-cli** code as little as
>>> possible
>>> while reusing the code from `fabrics.c`. The majority of changes in
>>> the
>>> existing code exports formerly static functions and variables, so
>>> that they
>>> are usable from the monitor code.
>>
>> General comment, can you please separate out fixes/cleanups that are
>> not
>> related to the goal of this patchset?
>
> Which ones are you referring to? 09 and 19? While these are minor
> improvements to the existing code, I wouldn't say they qualify as fixes
> or cleanups. They aren't necessary without adding the monitor code.
Patches that are not directly related to the goal of this should
be splitted out to ease the review.
> But yes, I can post all changes to existing code separately.
>
>>> * When "add" uevents for nvme controller devices are received,
>>> the
>>> controller is consistently not in `live` state yet, and
>>> attempting to read
>>> the `subsysnqn` sysfs attribute returns `(efault)`. While this
>>> should
>>> arguably be fixed in the kernel, it could be worked around in
>>> user space
>>> by using timers or polling the `state` sysfs attribute for
>>> changes.
>>
>> This is a bug, what in the code causes this? nothing in controller
>> state
>> should prevent from this sysfs read from executing correctly...
>
> I think it can be fixed by making nvme_sysfs_show_subsysnqn() fall back
> to ctrl->opts->subsysnqn if ctrl->subsys is NULL.
>
> I'll send a patch. Anyway, it'll take time until this is fixed
> everywhere.
Thanks.
>
>>
>>> * Parse and handle `discovery.conf` on startup.
>>
>> This is a must I think, where do you get the known transport
>> addresses
>> on startup today?
>
> There's a systemd service that runs "nvme connect-all" once during
> boot. That exists today. I'm not sure if it should be integrated in the
> monitor, perhaps it's good to keep these separate. People who don't
> need the monitor can still run the existing service only, whereas for
> others, the two would play together just fine.
Then doesn't this service need to run after it?
>>> * Implement support for RDMA and TCP protocols.
>>
>> What is needed for supporting them? Not sure I follow (I thought
>> you mentioned that you tested against linux nvmet-rdma?)
>>
>
> AENs over existing discovery controllers are supported for all
> transports. But there's no support for discovery of new transports
> except for NVMeoFC's "fc_udev_device" mechanism.
I see. thanks.
More information about the Linux-nvme
mailing list