nvme-cli connect regression

Tue Apr 1 14:40:54 PDT 2025

On Tue, Apr 1, 2025 at 8:25 AM Daniel Wagner <dwagner at suse.de> wrote:
>
> Hi,
>
> Luca reported that "occasional failures in the systemd integration test
> that uses nvme-cli:" [1]
>
> [   10.378713] TEST-84-STORAGETM.sh[316]: + nvme connect-all -t tcp -a 127.0.0.1 -s 16858 --hostid=95fe8041-3f53-415b-bc40-1bbd8932e7e8
> [   10.397892] nvme_tcp: queue 0: failed to receive icresp, error -4
> [   10.397326] TEST-84-STORAGETM.sh[340]: failed to add controller, error failed to write to nvme-fabrics device
>
> I was not able to identify any changes in nvme-cli v2.12 which could
> explain this problem. The kernel in question is a stable kernel (6.13.7)
> which got following commit back ported:
>
> 578539e09690 ("nvme-tcp: fix connect failure on receiving partial ICResp
> PDU").
>
> The change itself looks okay but I think introduces a behavior change
> for the initial connect attempt.
>
> The error code is EINTR, should the kernel retry here, or is userland in
> charge to retry? Assuming we should retry... Thoughts?
>
> Thanks,
> Daniel

I am not sure what an error code of EINTR means in the kernel. It
looks like the kernel_recvmsg() is performed on the thread that writes
to /dev/nvme-fabrics, so it's possible that there is a signal pending
for the userspace thread (i.e. nvme-cli)?
The kernel_recvmsg() should block until it receives all 128 bytes of
the ICResp PDU. It can also time out if the data is not received
before sk_rcvtimeo = 10 * HZ. But I would expect it to return EAGAIN
in that case. I was going to say it doesn't look like 10 seconds had
passed yet in the quoted logs, but they also show time going
backwards, so I am not sure what to make of the timestamps...

Best,
Caleb