[PATCH v2] nvmet: force reconnect when number of queue changes

Wed Sep 28 02:02:43 PDT 2022

On Wed, Sep 28, 2022 at 11:31:43AM +0300, Sagi Grimberg wrote:
> 
> > > > In order to be able to test queue number changes we need to make sure
> > > > that the host reconnects.
> > > > 
> > > > The initial idea was to disable and re-enable the ports and have the
> > > > host to wait until the KATO timer expires and enter error
> > > > recovery. But in this scenario the host could see DNR for a connection
> > > > attempt which results in the host dropping the connection completely.
> > > > 
> > > > We can force to reconnect the host by deleting all controllers
> > > > connected to subsystem, which results the host observing a failing
> > > > command and tries to reconnect.
> > > 
> > > This looks like a change that attempts to fix a host issue from the
> > > target side... Why do we want to do that?
> > 
> > It's not a host issue at all. The scenario I'd like to test a when
> > target changes this property while the host is connected (e.g. software
> > updated -> new configuration). I haven't found a way to signal the host
> > to reset/reconnect from the target. Hannes suggested to delete all
> > controllers from the given subsystem which will trigger the recovery
> > process on the host on the next request. This makes this test work.
> 
> But that is exactly like doing:
> - remove subsystem from port
> - apply q count change
> - link subsystem to port
> 
> Your problem is that the target returns an error code that makes the
> host to never reconnect. That is a host behavior, and that behavior is
> different from each transport used.

Yes, I try to avoid to trigger the DNR.

> This is why I'm not clear on weather this is the right place to
> address this issue.
> 
> I personally do not understand why a DNR completion makes the host
> choose to not reconnect. DNR means "do not retry" for the command
> itself (which the host adheres to), and it does not have any meaning to
> a reset/reconnect logic.

I am just the messenger: Besides Hannes' objection in the last mail
thread, I got this private reply from Fred Knight:

   Do Not Retry (DNR): If set to ‘1’, indicates that if the same command is
   re-submitted to any controller in the NVM subsystem, then that
   re-submitted command is expected to fail. If cleared to ‘0’, indicates
   that the same command may succeed if retried. If a command is aborted
   due to time limited error recovery (refer to the Error Recovery section
   in the NVM Command Set Specification), this bit should be cleared to
   ‘0’. If the SCT and SC fields are cleared to 0h, then this bit should be
   cleared to ‘0’.a

   It simply makes NO SENSE to retry that command. If the device wants the
   host to retry, then it will clear DNR=0.

> In my mind, a possible use-case is that a subsystem can be un-exported
> from a port for maintenance reasons, and rely on the host to
> periodically attempt to reconnect, and this is exactly what your test is
> doing.

Yes and that's the indented test case. The number of queue change is on
top of this scenario. It's a combined test case.

> > Though if you have a better idea how to signal the host to reconfigure
> > itself, I am glad to work on it.
> 
> I think we should first agree on what the host should/shoudn't do and
> make the logic consistent between all transports. Then we can talk about
> how to write a test for your test case.

Fair enough. This here was just my cheesy attempt to get things
moving.