[PATCH v2] nvme: fix reconnection fail due to reserved tag allocation

Sagi Grimberg sagi at grimberg.me
Thu Mar 7 03:18:50 PST 2024



On 07/03/2024 13:06, brookxu.cn wrote:
> From: Chunguang Xu <chunguang.xu at shopee.com>
>
> We found a issue on production environment while using NVMe
> over RDMA, admin_q reconnect failed forever while remote
> target and network is ok. After dig into it, we found it
> may caused by a ABBA deadlock due to tag allocation. In my
> case, the tag was hold by a keep alive request waiting
> inside admin_q, as we quiesced admin_q while reset ctrl,
> so the request maked as idle and will not process before
> reset success. As fabric_q shares tagset with admin_q,
> while reconnect remote target, we need a tag for connect
> command, but the only one reserved tag was held by keep
> alive command which waiting inside admin_q. As a result,
> we failed to reconnect admin_q forever. In order to fix
> this issue, I think we should keep two reserved tags for
> admin queue.
>
> Fixes: ed01fee283a0 ("nvme-fabrics: only reserve a single tag")
> Signed-off-by: Chunguang Xu <chunguang.xu at shopee.com>

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>



More information about the Linux-nvme mailing list