NVMe over Fabrics host: behavior on presence of ANA Group in "change" state

Sagi Grimberg sagi at grimberg.me
Mon Feb 7 03:07:30 PST 2022



On 2/6/22 15:59, Alex Talker wrote:
> Recently I noticed a peculiar error after connecting from the host
> (CentOS 8 Stream at the time, more on that below)
> via TCP(unlikely matters) to the NVMe target subsystem shared using 
> nvmet module:
> 
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > nvme nvme1: creating 8 I/O queues.
>  > nvme nvme1: mapped 8/0/0 default/read/poll queues.
>  > ...
>  > nvme nvme1: ANATT timeout, resetting controller.
>  > ...(and it continues like that over and over and over again, on some 
> configuration even getting worse with greater iterations of reconnect)
> 
> I discovered that this behavior is caused by code in 
> drivers/nvme/host/multipath.c,
> in particular when function nvme_update_ana_state increments value of 
> variable nr_change_groups whenever any ANA Group is in "change",
> indifference of whether any namespace belongs to the group or not.
> Now, after figuring out that ANATT stands for ANA Transition Time and 
> reading some more of the NVMe 2.0 standards, I understood that the 
> problem caused by how I managed to utilize ANA Groups.
> 
> As far as I remember, permitted number of ANA Groups in nvmet module is 
> 128, while maximum number of namespaces is 1024(8 times more).
> Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
> It is nice to have some logically-related namespaces belong to the same 
> ANA Group,
> and the final scheme of how namespaces belong to ANA groups is often 
> vendor-specific
> (or rather lies in decision domain of the end user of target-related 
> stuff),
> However, rather than changing state of a namespace on specific port, for 
> example for maintenance reasons,
> I find it particularly useful to utilize ANA Groups to change the state 
> of a certain namespace, since it is more likely that block device might 
> enter unusable state or be a part of some transitioning process.

I'm not exactly sure what you are trying to do, but it sounds wrong...
ANA groups are supposed to be a logical unit that expresses controllers
access state to the associated namespaces that belong to the group.

> Thus, the simplest scheme for me on each port is to assign few ANA 
> Groups, one per each possible ANA state, and change ANA Group on a 
> namespace rather than changing state of the group the namespace belongs 
> to at the moment.

That is an abuse of ANA groups IMO. But OK...

> And here's the catch.
> 
> If one creates a subsystem(no namespaces needed) on a port, connects to 
> it and then sets state of ANA Group #1 to "change", the issue introduced 
> in the beginning would be reproduced practically on many major distros 
> and even upstream code without and issue,

This state is not a permanent state, it is transient by definition,
which is why the host is treating it as such.

The host is expecting the controller to send another ANA AEN that
notifies the new state within ANATT (i.e. stateA -> change -> stateB).

> tho sometimes it can be mitigated by disabling the "native 
> multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but 
> sometimes that's not the case which is why this issue quite annoying for 
> my setup.

That is simply removing support for multipathing altogether.

> I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and 
> ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the 
> mainline and LTS kernels respectively for CentOSs).
> 
> The standard tells that:
> 
>  > An ANA Group may contain zero or more namespaces
> 
> which makes perfect sense, since one has to create a group prior to 
> assigning it to a namespace, and then:
> 
>  > While ANA Change state is reported by a controller for the namespace, 
> the host should: ...(part regarding ANATT)
> 
> So on one hand I think my setup might be questionable(I might allocate 
> ANAGRPID for "change" only in times of actual transitions, while that 
> might over-complicate usage of the module),

I'm still don't fully understand what you are trying to do, but creating
a transient ANA group for a change state sounds backwards to me.

> on the other I think it happens to be a misinterpretation of the 
> standard and might need some additional clarification.
> 
> That's why I decided to compose this message first prior to proposing 
> any patches.
> 
> Also, while digging the code, I noticed that ANATT at the moment 
> presented by a random constant(of 10 seconds), and since often 
> transition time differs depending on block devices being in-use 
> underneath namespaces,
> it might be viable to allow end-user to change this value via configfs.

How would you expose it via configfs? ana groups may be shared via
different ports IIRC. You would need to prevent conflicting settings...

> Considering everything I wrote, I'd like to hear opinions on the 
> following issues:
> 1. Whether my utilization of ANA Groups is viable approach?

I don't think so, but I don't know if I understood what you are trying
to do.

> 2. Which ANA Group assignment schemes utilized in production, from your 
> experience?

ANA groups will usually relate, a ANA group will be used for what it is
supposed to. A group of zero or more namespaces where each controller
may have different access state to it (or the namespaces assigned to
it).

> 3. Whether changing ANATT value change should be allowed via configfs(in 
> particular, on per-subsystem level I think)?

Could be... We'll need to see patches.



More information about the Linux-nvme mailing list