NVMe over Fabrics host: behavior on presence of ANA Group in "change" state

Alex Talker alextalker at yandex.ru
Fri Feb 11 12:53:48 PST 2022


Thanks for taking time to give the advance explanation! Now...

 > [FK> ] I'm not sure I understand that. Access state is always based
 > on the port, and ANA is totally about different access states on
 > different ports. If it was always the same on every port, then it
 > would be symmetric and there would be no need for ANA. The point of
 > the ANAGRPID is so the host can use a change of state reported for
 > one namespace to also recognize that an equivalent change has also
 > occurred for all other namespaces that have the same ANAGRPID.

I just meant that my setup is a little bit dumb.
In all previous messages I was talking in context of only one 
node("installation") but it's actually more cluster-like configuration 
on bigger picture.
Thus it is often (in my experience) when one namespace(i.e. underlying 
block device) needs separate attention at given while other aren't,
and ports rather disappear as a whole(for example due to broken cable) 
rather than part of the namespaces just unavailable on one of them.
Hence why I opted for such group configuration.
I do understand that the standard aims for more flexible path and it's 
okay, that's just too advanced for my application of this functionality.
One can also assume that namespace's NSID is global on such system for, 
again, pure simplicity.
And I do set NGUID to same value between nodes(when it's possible to 
have shared block device, present on all of them), so it's all fine and 
dandy in that part.

 > [FK> ] It would be fine for a host to track each NSID individually,
 > but they are unique only to a single NVM subsystem (if your host is
 > connected to an NVM subsystem from vendor 1, and also to an NVM
 > subsystem from vendor 2, then an NSID on the first subsystem is a
 > DIFFERENT namespace than the same NSID on the other NVM subsystem).
 > Dispersed namespaces are a different topic for a different thread.
 > And how a host does groupings of namespaces and how the ANAGRPID is
 > defined in the spec are independent.

The last statement precisely explains all the rest, since again, it's 
just my own setup and my own choice how to map things,
so as I highlighted above, in my case equal NGUID would likely yield 
equal NSID between different subsystems
(which might be setup in order to give different set of resources to 
different hosts, since list of allowed hosts is set in their plane in 
nvmet implementation).
I probably should had written a clearer explanation prior, sorry for the 
distraction.


 > Right now if a namespace changes its ANAGRPID, there is 1 AEN
 > required - for the ANA Log page contents changed (the NAMESPACE data
 > changed AEN is prohibited for this case). But, if the ANA changes in
 > the log page cause any groups to enter CHANGE state, then all
 > namespaces in that ANA Group are in the CHANGE state - not just the 1
 > namespace for which the ANAGRPID value changed. So storage that can
 > instantaneously change the ANAGRPID, the change is just about
 > inventory. But, for storage that takes time to move things around,
 > the whole "source" ANA group may enter CHANGE state (AEN), so the one
 > NSID can be removed (maybe another AEN), then the "destination" ANA
 > group enters CHANGE state (maybe another AEN), the "source" ANA group
 > can go out of CHANGE state (maybe another AEN), the "destination" ANA
 > group has the NSID added (maybe another AEN), and that "destination"
 > ANA group can go out of CHANGE state (maybe another AEN) - that means
 > stopping all commands to all the namespaces in both groups at some
 > point during that "move" process. How many changes happen (vs. how
 > many steps are combined), and how many AENs happen depends on how
 > long it takes, how many steps are merged vs. independent, and how the
 > host responds during that process. But no matter how it progresses,
 > that process is ugly, and something we wanted to optimize (via
 > TP4108). We hoped to create a way to optimize that transition.

So, did I got right, that it is advised to put ANA groups in "change" 
state when changing ANAGRPID(in sense of namespace attribute)?
Or did I completely lost the plot?
In any case, I sincerely hope that whatever is going on in this document 
I definitely have no access to reach, it's for the best!
I suppose I do get the basics of ANA groups tho(in regard that state 
changes for all group members at once) but thanks for the explanation 
anyway.

 > As for a group with zero attached namespaces - a host that uses RGO=0
 > will not get any state information about that group (it will simply
 > NOT be returned in the log page). If however, the host uses RGO=1,
 > then the host gets back a list of all groups and their states (and
 > there aren't ANY NSID values returned at all); meaning, there is no
 > way to determine from that data alone if there are any attached
 > namespaces or not. The point of RGO=1 is to be able to update the
 > state of the groups without having to parse all the NSID information
 > (just so it can be ignored).

Now I once again learned something new! So I get that RGO is an 
optimization, which is nice.
However, the piece of code I'm having problems with in this 
implementation(nvmet.ko) seems to opt for RGO=0
but I'm not completely sure. I did this conclusion based on the fact 
that nnsids is checked
withing a function I'm trying to patch (nvme_update_ana_state) and it 
clearly comes from the log.
Someone with more familiarity with the code base might give an idea 
whether RGO=1 is the case or it depends.

 > SO, what should happen for an ANA GROUP that has no namespaces when
 > that group enters CHANGE state. I don't see why it should be any
 > different than any other group. I'm not convinced a group with 0
 > namespaces is allowed to have any different behavior than a group
 > with 1 namespace attached. No group should remain in the CHANGE state
 > any longer than the ANATT timer value. However, when I read section
 > 8.10.4 Host ANA Change Notice operation (NVMe Base Spec 2.0), all the
 > recovery actions are described in the context of sending commands to
 > a namespace in the ANA Group, or the retries of commands being sent
 > to a namespace in the ANA Group. Obviously, that will never happen
 > for an ANA Group with no namespaces. EVEN the worst case scenario
 > says: "If the ANATT time interval expires, then the host should use a
 > different controller for sending commands to the namespaces in that
 > ANA Group." It's still about commands sent to namespaces. NOWHERE
 > does that text suggest a reset. If an ANATT timeout occurs - it says
 > pick a different path for sending commands to the namespaces in that
 > group (which is obviously a no-op when the group has no namespaces).

Why exactly the ANATT timer's function (nvme_anatt_timeout) opts for 
reset is unclear to me from the commit description to be honest.
The rest is my observation too.


> So if the timer is not started (because  there are 0 namespaces
 > attached) - and a namespace does come along (added to an ANA group
 > that is still in the CHANGE state), would the timer start when the
 > first command is sent to that namespace (and it fails with the
 > Asymmetric Access Transition)? That seems fine.

This is precisely what I'm aiming at with my patch in-progress, in this 
thread I just wanted to discuss its sanity prior to publishing,
the situation why I have the problem in the first place and to get other 
ideas on the way.
I'll double check but as far as I remember it worked fine with the patch.
So, just to be sure, you do agree then with my proposal that there's no 
point to start the timer prior to when at least one namespace becomes a 
member of such a group?

Much appreciated for your overall knowledge!


Best regards,
Alex




More information about the Linux-nvme mailing list