NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
Alex Talker
alextalker at yandex.ru
Mon Feb 7 07:04:12 PST 2022
> I'm not exactly sure what you are trying to do, but it sounds
> wrong... ANA groups are supposed to be a logical unit that expresses
> controllers access state to the associated namespaces that belong to
> the group.
I do agree that my setup might seem odd but I doubt it contradicts your
statement much,
since each group would represent state of namespaces belonging to it,
the difference is just that instate of having a complex(or should I say
one depending on installation/deployment)
relationship between a namespace and an ANA group, I opted for the
balancing act between flexibility of assigning state for a namespace
and having a constant set of ANA groups on each system.
In my view, it is rather often situation when one namespace has troubles
while others aren't and thus it better be unavailable on all ports at once,
rather than when certain port needs to deny access to certain namespaces
for, say, maintenance issues.
> That is an abuse of ANA groups IMO. But OK...
I do not disagree but so seems to do the standard.
But let me try to explain my perspective in possibly more familiar
analogy to you.
As you probably aware, with ALUA in SCSI, via Target Port Groups
mechanism, one can with zero worry specify certain LUN (ALUA) state on a
set of targets(at least in SCST implementation).
I ain't sure about certain limitations but I think it's quite easy to
keep up with 1 LUN = 1 group ratio for flexible control.
However, as I highlighter in earlier message, in nvmet implementation
there's allowed only 128 ANA Groups, while (each!) subsystem may keep up
to 1024 namespaces.
Thus, if I had no issue of say assigning a group per each
namespace(assuming that NSIDs are globally unique on my target), this is
currently not the case,
so I'm trying my best of out in these restrictions, while keeping ANA
Group setup as straightforward, as possible.
One may argue that I shall dump everything into one ANA Group but it
will contradict my expectations of High Availability of namespaces that
are still (mostly?) working while others aren't.
One also may argue that it's rare to have in production greater number
of namespaces than 128 in total but I still would prefer to go for
support of 1024 anyway.
Hope I cleared that one out, do feel free to correct me if I have a flaw
somewhere.
> This state is not a permanent state, it is transient by definition,
> which is why the host is treating it as such.
>
> The host is expecting the controller to send another ANA AEN that
> notifies the new state within ANATT (i.e. stateA -> change ->
> stateB).
As mentioned by Hannes, and I agree, state is indeed transient but only
in relation to a namespace,
so I find it to be zero issue of having a group in change state with 0
namespaces as its members.
I understand that it would be nice and dandy to change state of multiple
namespaces at once(if one can take time to configure such dependency
between them),
but I at the moment opt for simpler but flexible solution, maybe at the
cost of greater number of ANA log changes in worst-case scenario.
Thus, the cycle "namespace in state A" => "namespace in state of change"
=> "namespace in state B" is still preserved, tho with different
methods(change of a group rather than a state of the group).
> That is simply removing support for multipathing altogether.
You're not wrong on that one, tho, no offense, in certain configurations
or certain initiators that's a way to go.
Especially when it might be a matter of changing one implementation to
another(i.e. old good dm-multipath).
I mainly mentioned this because it fixes the issue on some
kernels(including mainline/LTS) while not on others,
which is why I think it's important that misinterpretation of the
standard will be accounted for on the mainstream code
since I can't possibly patch every single thing that lives on back-ports
for it(I personally look at CentOS world rn),
while it might be the end user of my target setups.
My territory is mainly the target and this is not the issue I can fix on
my side.
Besides, handling of my case differs from standard way anyway right now.
> I'm still don't fully understand what you are trying to do, but
> creating a transient ANA group for a change state sounds backwards to
> me.
As I stated, I'm just trying to work with present limitations, which I
suppose were chosen with regard to performance or something.
> Could be... We'll need to see patches.
On that regard, I have seen plenty of git-related mails around here,
so would it be possible to publish patches as a few commits based on
mainline or infradead git repo on GitHub or something?
Or is it mandatory to go, no offense, the old-fashioned way of sending
patch files as attachments or text?
I just 99.9% work with git and the former will be easier for me.
Best regards,
Alex
On 07.02.2022 14:07, Sagi Grimberg wrote:
>
>
> On 2/6/22 15:59, Alex Talker wrote:
>> Recently I noticed a peculiar error after connecting from the host
>> (CentOS 8 Stream at the time, more on that below) via TCP(unlikely
>> matters) to the NVMe target subsystem shared using nvmet module:
>>
>>> ... nvme nvme1: ANATT timeout, resetting controller. nvme nvme1:
>>> creating 8 I/O queues. nvme nvme1: mapped 8/0/0 default/read/poll
>>> queues. ... nvme nvme1: ANATT timeout, resetting controller.
>>> ...(and it continues like that over and over and over again, on
>>> some configuration even getting worse with greater iterations of
>>> reconnect)
>>
>> I discovered that this behavior is caused by code in
>> drivers/nvme/host/multipath.c, in particular when function
>> nvme_update_ana_state increments value of variable nr_change_groups
>> whenever any ANA Group is in "change", indifference of whether any
>> namespace belongs to the group or not. Now, after figuring out that
>> ANATT stands for ANA Transition Time and reading some more of the
>> NVMe 2.0 standards, I understood that the problem caused by how I
>> managed to utilize ANA Groups.
>>
>> As far as I remember, permitted number of ANA Groups in nvmet
>> module is 128, while maximum number of namespaces is 1024(8 times
>> more). Thus, mapping 1 namespace to 1 ANA Group works only up to a
>> point. It is nice to have some logically-related namespaces belong
>> to the same ANA Group, and the final scheme of how namespaces
>> belong to ANA groups is often vendor-specific (or rather lies in
>> decision domain of the end user of target-related stuff), However,
>> rather than changing state of a namespace on specific port, for
>> example for maintenance reasons, I find it particularly useful to
>> utilize ANA Groups to change the state of a certain namespace,
>> since it is more likely that block device might enter unusable
>> state or be a part of some transitioning process.
>
> I'm not exactly sure what you are trying to do, but it sounds
> wrong... ANA groups are supposed to be a logical unit that expresses
> controllers access state to the associated namespaces that belong to
> the group.
>
>> Thus, the simplest scheme for me on each port is to assign few ANA
>> Groups, one per each possible ANA state, and change ANA Group on a
>> namespace rather than changing state of the group the namespace
>> belongs to at the moment.
>
> That is an abuse of ANA groups IMO. But OK...
>
>> And here's the catch.
>>
>> If one creates a subsystem(no namespaces needed) on a port,
>> connects to it and then sets state of ANA Group #1 to "change", the
>> issue introduced in the beginning would be reproduced practically
>> on many major distros and even upstream code without and issue,
>
> This state is not a permanent state, it is transient by definition,
> which is why the host is treating it as such.
>
> The host is expecting the controller to send another ANA AEN that
> notifies the new state within ANATT (i.e. stateA -> change ->
> stateB).
>
>> tho sometimes it can be mitigated by disabling the "native
>> multipath"(when /sys/module/nvme_core/parameters/multipath set to
>> N) but sometimes that's not the case which is why this issue quite
>> annoying for my setup.
>
> That is simply removing support for multipathing altogether.
>
>> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and
>> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the
>> mainline and LTS kernels respectively for CentOSs).
>>
>> The standard tells that:
>>
>>> An ANA Group may contain zero or more namespaces
>>
>> which makes perfect sense, since one has to create a group prior to
>> assigning it to a namespace, and then:
>>
>>> While ANA Change state is reported by a controller for the
>>> namespace, the host should: ...(part regarding ANATT)
>>
>> So on one hand I think my setup might be questionable(I might
>> allocate ANAGRPID for "change" only in times of actual transitions,
>> while that might over-complicate usage of the module),
>
> I'm still don't fully understand what you are trying to do, but
> creating a transient ANA group for a change state sounds backwards to
> me.
>
>> on the other I think it happens to be a misinterpretation of the
>> standard and might need some additional clarification.
>>
>> That's why I decided to compose this message first prior to
>> proposing any patches.
>>
>> Also, while digging the code, I noticed that ANATT at the moment
>> presented by a random constant(of 10 seconds), and since often
>> transition time differs depending on block devices being in-use
>> underneath namespaces, it might be viable to allow end-user to
>> change this value via configfs.
>
> How would you expose it via configfs? ana groups may be shared via
> different ports IIRC. You would need to prevent conflicting
> settings...
>
>> Considering everything I wrote, I'd like to hear opinions on the
>> following issues: 1. Whether my utilization of ANA Groups is viable
>> approach?
>
> I don't think so, but I don't know if I understood what you are
> trying to do.
>
>> 2. Which ANA Group assignment schemes utilized in production, from
>> your experience?
>
> ANA groups will usually relate, a ANA group will be used for what it
> is supposed to. A group of zero or more namespaces where each
> controller may have different access state to it (or the namespaces
> assigned to it).
>
>> 3. Whether changing ANATT value change should be allowed via
>> configfs(in particular, on per-subsystem level I think)?
>
> Could be... We'll need to see patches.
More information about the Linux-nvme
mailing list