[PATCH rfc 0/1] Fix missing AENs when discovery controllers are disconnected

Sagi Grimberg sagi at grimberg.me
Wed Sep 14 08:31:08 PDT 2022



On 9/14/22 18:00, James Smart wrote:
> On 9/14/2022 3:29 AM, Sagi Grimberg wrote:
>>
>>> I was thinking this rather than the new option... The bit would only 
>>> be set *after* the first successful link-side connect, thus the 
>>> initial nvme_start_ctrl would not send the reconnect event. Every 
>>> reconnect thereafter would.
>>>
>>> -- james
>>>
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 2429b11eb9a8..43c8b6590164 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -4814,6 +4814,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>>
>>>       nvme_enable_aen(ctrl);
>>>
>>> +    if (nvme_discovery_ctrl(ctrl) &&
>>> +        test_bit(NVME_CTRL_FABRIC_CONNECTED, &ctrl->flags))
>>> +        nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>> +
>>>       if (ctrl->queue_count > 1) {
>>>           nvme_queue_scan(ctrl);
>>>           nvme_start_queues(ctrl);
>>> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
>>> index 127abaf9ba5d..ff1dd8f999b0 100644
>>> --- a/drivers/nvme/host/fc.c
>>> +++ b/drivers/nvme/host/fc.c
>>> @@ -2947,6 +2947,7 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl 
>>> *ctrl)
>>>           goto out_delete_hw_queues;
>>>
>>>       ctrl->ioq_live = true;
>>> +    set_bit(NVME_CTRL_FABRIC_CONNECTED, &ctrl->ctrl.flags);
>>
>> Isn't this set before calling nvme_start_ctrl()?
> 
> frick - sent the wrong tweak for fc:
> 
> this should be:
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 127abaf9ba5d..630d8e50f29c 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -3201,6 +3201,8 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
>          if (changed)
>                  nvme_start_ctrl(&ctrl->ctrl);
> 
> +       set_bit(NVME_CTRL_FABRIC_CONNECTED, &ctrl->ctrl.flags);
> +
>          return 0;       /* Success */
> 
>   out_term_aen_ops:
> 
> 
> So in all cases, it's now set immediately after calling nvme_start_ctrl. 
> Won't be set on 1st call but will be all on all reamaing.

I would rename it as NVME_CTRL_STARTED_ONCE or something. I feel that
every name we choose would be kinda ugly, which is why I opted to
relying on nr_reconnects that the transports maintain.

But again, all of this is orthogonal to the connect parameter question.

>> But this is orthogonal to the user argument, is there a specific
>> reason to why not add an explicit argument requesting that?
>> Because every userspace that will setup a persistent discovery
>> controller will need this event?
> 
> Well my aversion is to remove relationships between user-space and 
> kernel.  It seemed very odd to go to user-space to then set a custom 
> option and this handshaking looked like a lot of work vs just generating 
> a reconnect event on a discovery controller.
> 
> In my mind think we should be doing one of the following:
> a) generate connect and reconnect events generically. If they are 
> handled, great, if not that's fine. Handler is free to filter by device 
> attributes as to what it does

I can understand this approach. A persistent discovery controller that
reconnects, needs to send an indication to userspace universally because
someone always needs to rediscover as a result.

What do others think? Martin?

> I assume there's some issue here that you saw with the original attempt 
> as reconnects may post while connects or reconnects are still being 
> handled.
> 
> b) go down the path of the rediscover option - but the kernel routine 
> should never require it to be a discovery controller. If the option is 
> set, send the rediscover event. Removes events that won't be handled.
> The filtering in the lib to only set the option on devices that the lib 
> cares about can continue.

The fact that the event is called rediscover semantically means that
this is the discovery controller. It would be weird coming on any other
controller, even if userspace happened to ask for it. If at all, I'd
prefer to fail the parameter for controllers that are not discovery
controller, which is strict, but at least not resulting in a weird
event...



More information about the Linux-nvme mailing list