[PATCH rfc 0/1] Fix missing AENs when discovery controllers are disconnected

Mon Sep 12 17:06:48 PDT 2022

On 9/12/2022 5:39 AM, Sagi Grimberg wrote:
>>> When a discovery controller is disconnected, no AENs will
>>> arrive to notify the host about discovery log-page changes.
>>>
>>> This attempts to fix it with a new connection parameter asking
>>> the kernel to send a dedicated udev event for this case. Prior
>>> attempt tried to use "connected" event already sent by the kernel
>>> however this also applied on the first connected, causing undesired
>>> side-effects when issuing a simple 'discover' command.
>>>
>>> The patchset includes the nvme-cli/libnvme counter-parts as well.
>>
>> Sagi,
>>
>> Do we really need another parameter/option ?  seems awkward to need 
>> userspace to figure it out.
> 
> It has a rather specific use though..
> 
> I was looking for something more fine-grained because I initially
> used the generic "connected" uevent which had some negative side
> effects, so I decided against adding yet another generic "reconnected"
> event that nvme-cli would use, and may/may-not fit other use-cases.
> 
> I thought its better to have something that is as specific as possible
> like "send me an event only when a persistent discovery-controller
> reconnects" and I also used the name to indicate the semantic of
> the event, so in the future, a possible consumer of this event will
> know exactly what it is used for before opting to use it.
> 
>> The transport certainly knows the difference between 1st time connect 
>> and nth time re-connect - perhaps this should be a simple flag on the 
>> nvme ctrl struct set by the transport ?  Simple thing for rdma/tcp to 
>> set flag at end of successful xxx_create_ctrl(), while FC, given it 
>> does reconnects w/o a first successful connect needs a little more logic.
> 
> I used ctrl->nr_reconnects to decide it, which introduced the constraint
> that transports would clear it _after_ calling nvme_start_ctrl(). Open
> to better suggestions though...

I was thinking this rather than the new option... The bit would only be 
set *after* the first successful link-side connect, thus the initial 
nvme_start_ctrl would not send the reconnect event. Every reconnect 
thereafter would.

-- james

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2429b11eb9a8..43c8b6590164 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4814,6 +4814,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)

  	nvme_enable_aen(ctrl);

+	if (nvme_discovery_ctrl(ctrl) &&
+	    test_bit(NVME_CTRL_FABRIC_CONNECTED, &ctrl->flags))
+		nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
+
  	if (ctrl->queue_count > 1) {
  		nvme_queue_scan(ctrl);
  		nvme_start_queues(ctrl);
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 127abaf9ba5d..ff1dd8f999b0 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2947,6 +2947,7 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
  		goto out_delete_hw_queues;

  	ctrl->ioq_live = true;
+	set_bit(NVME_CTRL_FABRIC_CONNECTED, &ctrl->ctrl.flags);

  	return 0;

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bdc0ff7ed9ab..9b4260d60516 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -356,6 +356,7 @@ struct nvme_ctrl {
  	unsigned long flags;
  #define NVME_CTRL_FAILFAST_EXPIRED	0
  #define NVME_CTRL_ADMIN_Q_STOPPED	1
+#define NVME_CTRL_FABRIC_CONNECTED	2
  	struct nvmf_ctrl_options *opts;