NVMe over Fabrics host: behavior on presence of ANA Group in "change" state
Alex Talker
alextalker at yandex.ru
Sun Feb 6 05:59:44 PST 2022
Recently I noticed a peculiar error after connecting from the host
(CentOS 8 Stream at the time, more on that below)
via TCP(unlikely matters) to the NVMe target subsystem shared using
nvmet module:
> ...
> nvme nvme1: ANATT timeout, resetting controller.
> nvme nvme1: creating 8 I/O queues.
> nvme nvme1: mapped 8/0/0 default/read/poll queues.
> ...
> nvme nvme1: ANATT timeout, resetting controller.
> ...(and it continues like that over and over and over again, on some
configuration even getting worse with greater iterations of reconnect)
I discovered that this behavior is caused by code in
drivers/nvme/host/multipath.c,
in particular when function nvme_update_ana_state increments value of
variable nr_change_groups whenever any ANA Group is in "change",
indifference of whether any namespace belongs to the group or not.
Now, after figuring out that ANATT stands for ANA Transition Time and
reading some more of the NVMe 2.0 standards, I understood that the
problem caused by how I managed to utilize ANA Groups.
As far as I remember, permitted number of ANA Groups in nvmet module is
128, while maximum number of namespaces is 1024(8 times more).
Thus, mapping 1 namespace to 1 ANA Group works only up to a point.
It is nice to have some logically-related namespaces belong to the same
ANA Group,
and the final scheme of how namespaces belong to ANA groups is often
vendor-specific
(or rather lies in decision domain of the end user of target-related stuff),
However, rather than changing state of a namespace on specific port, for
example for maintenance reasons,
I find it particularly useful to utilize ANA Groups to change the state
of a certain namespace, since it is more likely that block device might
enter unusable state or be a part of some transitioning process.
Thus, the simplest scheme for me on each port is to assign few ANA
Groups, one per each possible ANA state, and change ANA Group on a
namespace rather than changing state of the group the namespace belongs
to at the moment.
And here's the catch.
If one creates a subsystem(no namespaces needed) on a port, connects to
it and then sets state of ANA Group #1 to "change", the issue introduced
in the beginning would be reproduced practically on many major distros
and even upstream code without and issue,
tho sometimes it can be mitigated by disabling the "native
multipath"(when /sys/module/nvme_core/parameters/multipath set to N) but
sometimes that's not the case which is why this issue quite annoying for
my setup.
I just checked it on 5.15.16 from Manjaro(basically Arch Linux) and
ELRepo's kernel-ml and kernel-lt(basically vanilla versions of the
mainline and LTS kernels respectively for CentOSs).
The standard tells that:
> An ANA Group may contain zero or more namespaces
which makes perfect sense, since one has to create a group prior to
assigning it to a namespace, and then:
> While ANA Change state is reported by a controller for the namespace,
the host should: ...(part regarding ANATT)
So on one hand I think my setup might be questionable(I might allocate
ANAGRPID for "change" only in times of actual transitions, while that
might over-complicate usage of the module),
on the other I think it happens to be a misinterpretation of the
standard and might need some additional clarification.
That's why I decided to compose this message first prior to proposing
any patches.
Also, while digging the code, I noticed that ANATT at the moment
presented by a random constant(of 10 seconds), and since often
transition time differs depending on block devices being in-use
underneath namespaces,
it might be viable to allow end-user to change this value via configfs.
Considering everything I wrote, I'd like to hear opinions on the
following issues:
1. Whether my utilization of ANA Groups is viable approach?
2. Which ANA Group assignment schemes utilized in production, from your
experience?
3. Whether changing ANATT value change should be allowed via configfs(in
particular, on per-subsystem level I think)?
Thanks for reading till the very end! Hope I didn't ramble too much, I
just wanted to only lay out all of the details.
Best regards,
Alex
More information about the Linux-nvme
mailing list