NVMe over Fabrics host: behavior on presence of ANA Group in "change" state

Hannes Reinecke hare at suse.de
Mon Feb 7 01:46:08 PST 2022


On 2/6/22 14:59, Alex Talker wrote:
> Recently I noticed a peculiar error after connecting from the host
> (CentOS 8 Stream at the time, more on that below)
> via TCP(unlikely matters) to the NVMe target subsystem shared using 
> nvmet module:
> 
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > nvme nvme1: creating 8 I/O queues.
>  > nvme nvme1: mapped 8/0/0 default/read/poll queues.
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > ...(and it continues like that over and over and over again, on some 
> configuration even getting worse with greater iterations of reconnect)
> 
> I discovered that this behavior is caused by code in 
> drivers/nvme/host/multipath.c,
> in particular when function nvme_update_ana_state increments value of 
> variable nr_change_groups whenever any ANA Group is in "change",
> indifference of whether any namespace belongs to the group or not.
> Now, after figuring out that ANATT stands for ANA Transition Time and 
> reading some more of the NVMe 2.0 standards, I understood that the 
> problem caused by how I managed to utilize ANA Groups.
> 
> As far as I remember, permitted number of ANA Groups in nvmet module is 
> 128, while maximum number of namespaces is 1024(8 times more).
> Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
> It is nice to have some logically-related namespaces belong to the same 
> ANA Group,
> and the final scheme of how namespaces belong to ANA groups is often 
> vendor-specific
> (or rather lies in decision domain of the end user of target-related 
> stuff),
> However, rather than changing state of a namespace on specific port, for 
> example for maintenance reasons,
> I find it particularly useful to utilize ANA Groups to change the state 
> of a certain namespace, since it is more likely that block device might 
> enter unusable state or be a part of some transitioning process.
> Thus, the simplest scheme for me on each port is to assign few ANA 
> Groups, one per each possible ANA state, and change ANA Group on a 
> namespace rather than changing state of the group the namespace belongs 
> to at the moment.
> And here's the catch.
> 
> If one creates a subsystem(no namespaces needed) on a port, connects to 
> it and then sets state of ANA Group #1 to "change", the issue introduced 
> in the beginning would be reproduced practically on many major distros 
> and even upstream code without and issue,
> tho sometimes it can be mitigated by disabling the "native 
> multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
> sometimes that's not the case which is why this issue quite annoying for 
> my setup.
> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
> mainline and LTS kernels respectively for CentOSs).
> 
> The standard tells that:
> 
>  > An ANA Group may contain zero or more namespaces
> 
> which makes perfect sense, since one has to create a group prior to 
> assigning it to a namespace, and then:
> 
>  > While ANA Change state is reported by a controller for the namespace, 
> the host should: ...(part regarding ANATT)
> 
> So on one hand I think my setup might be questionable(I might allocate 
> ANAGRPID for "change" only in times of actual transitions, while that 
> might over-complicate usage of the module),
> on the other I think it happens to be a misinterpretation of the 
> standard and might need some additional clarification.
> 
That's actually a misinterpretation.
The above sentence refers to a device reporting to be in ANA 'change', 
ie after reading the ANA log and detecting that a given namespace is in 
a group whose ANA status is 'change'.

In your case it might be feasible to not report 'change' at all, but 
rather do an direct transition from one group to the other.
If it's just a single namespace the transition should be atomic, and 
hence there won't be any synchronisation issues which might warrant a 
'change' state.

> That's why I decided to compose this message first prior to proposing 
> any patches.
> 
> Also, while digging the code, I noticed that ANATT at the moment 
> presented by a random constant(of 10 seconds), and since often 
> transition time differs depending on block devices being in-use 
> underneath namespaces,
> it might be viable to allow end-user to change this value via configfs.
> 
> Considering everything I wrote, I'd like to hear opinions on the 
> following issues:
> 1. Whether my utilization of ANA Groups is viable approach?

Well, it certainly is an odd one, but should be doable.
But note, there had been some fixes to the ANA group ID handling;
most recently commit 79f528afa939 ("nvme-multipath: fix ANA state 
updates when a namespace is not present").
So do ensure to have the latest fixes to get the 'best' possible 
user-experience.

> 2. Which ANA Group assignment schemes utilized in production, from your 
> experience?

Typically it's the NVMe controller port which holds the ANA state; for 
most implementation I'm aware of you have one or more (physical) NVMe 
controller ports, which hosts the interfaces etc.
They connect to the actual storage, and failover is done by switching 
I/O between those controller ports.
Hence the ANA state is really property of the controller port in those 
implementations.

> 3. Whether changing ANATT value change should be allowed via configfs(in 
> particular, on per-subsystem level I think)?
> 
The ANATT value is useful if you have an implementation which takes some 
time to facilitate the switch-over. As there is always a chance of the 
switch-over going wrong the ANATT serves as an upper boundary after 
which an ANA state of 'change' can be considered stale, and a re-read is 
in order.
So for the linux implementation it's a bit moot; one would have to 
present a use-case where changing the ANATT value would make a difference.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare at suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



More information about the Linux-nvme mailing list