[PATCH v2] nvmet: force reconnect when number of queue changes

Wed Sep 28 09:01:55 PDT 2022

> two targets (*)

Yes, that is what it would be called in SCSI (2 - SCSI Target Devices).  In NVMe, that would more likely be 1 NVM Subsystem with 2 controllers (which could be on the same port - although that would not really be redundant; or on different ports - which would be more likely).  There's also TP4034 with 2 NVM subsystems - but that is a completely different discussion.

There is an AEN for firmware updates - "Firmware Activation Starting", but that one requires just a "pause" in command processing, and not new connections (it is for a different use case than what you're talking about).

If you had static controllers, then the Discovery Controller would report those changes; so, if you had persistent connections to the discovery controller, you'd know when it went away and when it returned.

For dynamic controllers, it's a little harder (since they all get reported under the single ID of FFFFh), so there is no change when a dynamic controller goes away and then comes back - on the same port).

If these controllers were on different ports, and the whole port goes away, then the Discovery Controller should be reporting the changes - the removal of a whole port and then the addition of a whole port.  But you'd still need a persistent connection to the Discovery Controller so it can send the AEN to notify the host about those changes.

I agree that adding some new feature just for a test case isn't really a good idea.  For one, it doesn't really test what would happen in the real world.

	Fred

> -----Original Message-----
> From: Daniel Wagner <dwagner at suse.de>
> Sent: Wednesday, September 28, 2022 9:51 AM
> To: Knight, Frederick <Frederick.Knight at netapp.com>
> Cc: Sagi Grimberg <sagi at grimberg.me>; linux-nvme at lists.infradead.org;
> Shinichiro Kawasaki <shinichiro.kawasaki at wdc.com>; hare at suse.de
> Subject: Re: [PATCH v2] nvmet: force reconnect when number of queue
> changes
> 
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
> 
> 
> 
> 
> On Wed, Sep 28, 2022 at 12:39:44PM +0000, Knight, Frederick wrote:
> > Would this be a case for a new AEN - controller configuration changed?
> > I'm also wondering exactly what changed in the controller?  It can't
> > be the "Number of Queues" feature (that can't change - The controller
> > shall not change the value allocated between resets.).  Is it the MQES
> > field in the CAP property that changes (queue size)?
> >
> > We already have change reporting for: Namespace attribute, Predictable
> > Latency, LBA status, EG aggregate, Zone descriptor, Discovery Log,
> > Reservations. We've questioned whether we need a Controller Attribute
> > Changed.
> >
> > Would this be a case for an exception?  Does the DNR bit apply only to
> > commands sent on queues that already exist (i.e., NOT the connect
> > command since that command is actually creating the queue)?  FWIW - I
> > don't like exceptions.
> >
> > Can you elaborate on exactly what is changing?
> 
> The background story is, that we have a setup with two targets (*) and the
> host is connected two both of them (HA setup). Both server run the same
> software version. The host is told via Number of Queues (Feature Identifier
> 07h) how many queues are supported (N queues).
> 
> Now, a software upgrade is started which takes first one server offline,
> updates it and brings it online again. Then the same process with the second
> server.
> 
> In the meantime the host tries to reconnect. Eventually, the reconnect is
> successful but the Number of Queues (Feature Identifier 07h) has changed to
> a smaller value, e.g N-2 queues.
> 
> My test case here is trying to replicated this scenario but just with one target.
> Hence the problem how to notify the host that it should reconnect. As you
> mentioned this is not to supposed to change as long a connection is
> established.
> 
> My understanding is that the current nvme target implementation in Linux
> doesn't really support this HA setup scenario hence my attempt to get it
> flying with one target. The DNR bit comes into play because I was toying
> with removing the subsystem from the port, changing the number of queues
> and re-adding the subsystem to the port.
> 
> This resulted in any request posted from the host seeing the DNR bit. The
> second attempt here was to delete the controller to force a reconnect. I
> agree it's also not really the right thing to do.
> 
> As far I can tell, what's is missing from a testing point of view is the ability to
> fail requests without the DNR bit set or the ability to tell the host to
> reconnect. Obviously, an AEN would be nice for this but I don't know if this is
> reason enough to extend the spec.
> 
> I can't really say if this is a real world scenario or just a result of trying to cut
> corners. Anyway, I am glad to do the work if we can agree on how this test
> case could be implemented.
> 
> Daniel
> 
> (*) not sure how to call it properly, is this one target or two targets?