[Bug Report] nvme connect deadlock in allocating tag

Sun Apr 28 05:38:34 PDT 2024

On 28/04/2024 13:25, kwb wrote:
>> On 28/04/2024 12:16, Wangbing Kuang wrote:
>>> "The error_recovery work should unquiesce the admin_q, which should fail
>>> fast all pending admin commands,
>>> so it is unclear to me how the connect process gets stuck."
>>> I think the reason is: the command can be unquiesce but the tag cannot be
>>> return until command success.
>> The error recovery also cancels all pending requests. See
>> nvme_cancel_admin_tagset
> nvme_cancel_admin_tagset can cancel requests before stop admin queue, but
> cannot cancel requests before next reconnect time.

the error recovery does quiesce + cancel_admin_taget + unquiesce, all 
following
admin I/O should fail immediately upon submission as the ctrl/queue is 
not live.

> The time line is:
> recover failed(we can reproduce by hang io for more time)
> -> reconnect delay
> -> multi nvme list issue(used up tagset)
> -> reconnect start(wait for tag when call nvme_enabel_ctrl and nvme_wait_ready)

failing all admin I/O should not be associated with the next reconnect, 
it happens
way before that, in the error recovery work. Hence it is still not clear 
to me how
you are seeing what you are seeing.

It is possible that 5.15 is missing something.

>
>
>>> "What is step (2) - make nvme io timeout to recover the connection?"
>>> I use spdk-nvmf-target for backend.  It is easy to set read/write
>>> nvmf-target io  hang and unhang.  So I just set the io hang for over 30
>>> seconds, then trigger linux-nvmf-host trigger io timeout event. then io
>>> timeout will trigger connection recover.
>>> by the way, I use multipath=0
>> Interesting, does this happen with multipath=Y ?
>> I didn't expect people to be using multipath=0 for fabrics in the past few
>> years.
> No certain, I did not test on multipath=Y.We choose multipath=0 cos less code and we need only one path
>
>>> "Is this reproducing with upstream nvme? or is this some distro kernel
>>> where this happens?"
>>> it is reproduced in a kernel based from v5.15, but I think this is common
>>> error.
>> It would be beneficial to verify this.
> ok, test need more time, but we can first verify it only in v5.15.

We should not be spending time debugging an issue that might have
been addressed in upstream. The first thing we should do is to understand
if this reproduces in upstream, if so fix it, if not identify the 
missing patch(es)
in 5.15