NVMe over Fabrics host: behavior on presence of ANA Group in "change" state

Fri Feb 11 08:58:05 PST 2022

> 
> 
>  > [FK> ] I think I'm missing a bunch of context here. What is the  > original
> question? I take a stab at some assumptions: What is an  > empty ANA
> group? That is an ANA Group with NO NSIDs associated with  > that group.
> Meaning the "Number of NSID Values" field is cleared to  > '0h' in the ANA
> Group Descriptor. That descriptor can be used to  > update some host
> internal state information related to that ANA  > group, but it has no impact
> on any I/O because there can be no I/O  > (since there are no NSID values).
> So I'm not sure where that is  > going (because RGO=1 also can return ANA
> Groups that have state, but  > no attached namespaces (it's a way to get
> group state without any  > NSID inventory requirements)).
> 
> That's exactly right, "nnsids=0" case. I/O is not a problem for such a group, for
> sure.
> I suppose the main argument we're having here is that when such a group
> has a "change" ANA state, the host("nvme-core" module) starts a timer for
> ANATT which upon expiration resets the controller.
> Now, I do not disagree that having such a group is "ugly" but rather argue
> that ANATT-related functionality could be only invoked for "nnsids>0" case,
> since only then there's a relation between "change" state and a namespace
> via "ANAGRPID".
> 
> My approach for assigning ANA groups to namespaces involves and idea that
> on one node(i.e. "system") casually a namespace has the same state on
> every port, since it's more likely that access state of the namespace would
> change, rather than what's it accessed through (the port), so I simply pre-
> allocate 5 ANA groups per 5 possible at the moment ANA states on each port
> and then change "ANAGRPID" of a namespace to transition it from one state
> to another.

[FK> ] I'm not sure I understand that.  Access state is always based on the port, and ANA is totally about different access states on different ports.  If it was always the same on every port, then it would be symmetric and there would be no need for ANA.  The point of the ANAGRPID is so the host can use a change of state reported for one namespace to also recognize that an equivalent change has also occurred for all other namespaces that have the same ANAGRPID.

> While it is perfectly possible as highlighter earlier to transition bypassing
> "change" state, it is still preferable in my opinion in situations when the final
> state is not known "a priori", and thus works as a graceful guard from host's
> I/O. This is why I opt to pre-allocate one for this state too, however on
> modern versions of popular distributions that causes the reset issue
> described before, which might have undetermined impact on my I/O in
> progress.
> 
> Thus, I find starting the ANATT timer redundant when "nnsids=0".
> I think the only users such a change might affect if someone uses this as a
> dirty hack to reset controller on host(when that would be helpful tho?).
> Otherwise, I have prepared & checked on the mainline a simple(+2 lines,
> -2 lines) patch that fixes this behavior, so I might sent it if it's preferable to
> have this discussion around an actual change.
> 
>  > Now this treads into the TP 4108 space. There is currently no way to  >
> report anything that impacts "only one namespace at a time". ANY  > report
> of a change (AEN) for any namespace is always reporting a  > state change for
> the entire group that contains the namespace where  > the event occurred.
> That is the WHOLE POINT of ANA Groups. AND,  > that is the whole point of
> TP4108 - to address that kind of situation  > (where a change impacts only 1
> namespace). Until TP4108 address this  > situation, a single namespace
> changing the ANAGRPID is ugly. Maybe  > we should get to work on that TP.
> 
> I ain't no member of a committee or something(unfortunately), so I have no
> idea what TP 4108 is about or where to find it.
> But my main message on this passage was not in a sense how little data
> would be exchanged between target & hosts but rather for how many
> namespace relation between them and associated with them ANA state
> would change, as to highlight the contrast between changing ANA state of a
> group and changing ANAGRPID of a namespace.
> Again, I do not disagree that it's ugly but on the matter why I can't just go an
> assign each namespace(assuming NSID is global on my target system rather
> than one of the subsystems) a separate ANA Group due to 8 times
> difference between allowed number of the first and the latter, I proposed to
> parametrize that in previous message but got no reply in that regard
> unfortunately.

[FK> ] It would be fine for a host to track each NSID individually, but they are unique only to a single NVM subsystem (if your host is connected to an NVM subsystem from vendor 1, and also to an NVM subsystem from vendor 2, then an NSID on the first subsystem is a DIFFERENT namespace than the same NSID on the other NVM subsystem).  Dispersed namespaces are a different topic for a different thread.  And how a host does groupings of namespaces and how the ANAGRPID is defined in the spec are independent.

Right now if a namespace changes its ANAGRPID, there is 1 AEN required - for the ANA Log page contents changed (the NAMESPACE data changed AEN is prohibited for this case). But, if the ANA changes in the log page cause any groups to enter CHANGE state, then all namespaces in that ANA Group are in the CHANGE state - not just the 1 namespace for which the ANAGRPID value changed. So storage that can instantaneously change the ANAGRPID, the change is just about inventory.  But, for storage that takes time to move things around, the whole "source" ANA group may enter CHANGE state (AEN), so the one NSID can be removed (maybe another AEN), then the "destination" ANA group enters CHANGE state (maybe another AEN), the "source" ANA group can go out of CHANGE state (maybe another AEN), the "destination" ANA group has the NSID added (maybe another AEN), and that "destination" ANA group can go out of CHANGE state (maybe another AEN) - that means stopping all commands to all the namespaces in both groups at some point during that "move" process.  How many changes happen (vs. how many steps are combined), and how many AENs happen depends on how long it takes, how many steps are merged vs. independent, and how the host responds during that process.  But no matter how it progresses, that process is ugly, and something we wanted to optimize (via TP4108).  We hoped to create a way to optimize that transition.

As for a group with zero attached namespaces - a host that uses RGO=0 will not get any state information about that group (it will simply NOT be returned in the log page).  If however, the host uses RGO=1, then the host gets back a list of all groups and their states (and there aren't ANY NSID values returned at all); meaning, there is no way to determine from that data alone if there are any attached namespaces or not.  The point of RGO=1 is to be able to update the state of the groups without having to parse all the NSID information (just so it can be ignored).

SO, what should happen for an ANA GROUP that has no namespaces when that group enters CHANGE state.  I don't see why it should be any different than any other group.  I'm not convinced a group with 0 namespaces is allowed to have any different behavior than a group with 1 namespace attached. No group should remain in the CHANGE state any longer than the ANATT timer value.  However, when I read section 8.10.4 Host ANA Change Notice operation (NVMe Base Spec 2.0), all the recovery actions are described in the context of sending commands to a namespace in the ANA Group, or the retries of commands being sent to a namespace in the ANA Group.  Obviously, that will never happen for an ANA Group with no namespaces.  EVEN the worst case scenario says: "If the ANATT time interval expires, then the host should use a different controller for sending commands to the namespaces in that ANA Group."  It's still about commands sent to namespaces.  NOWHERE does that text suggest a reset.  If an ANATT timeout occurs - it says pick a different path for sending commands to the namespaces in that group (which is obviously a no-op when the group has no namespaces).

So if the timer is not started (because there are 0 namespaces attached) - and a namespace does come along (added to an ANA group that is still in the CHANGE state), would the timer start when the first command is sent to that namespace (and it fails with the Asymmetric Access Transition)?  That seems fine.

> 
> Hope that more or less cleared things out.
> 
> Thanks for your time!
> 
> Best regards,
> Alex