nvme-cli connect regression

Tue Apr 15 14:26:40 PDT 2025

On Tue, Apr 15, 2025 at 2:14 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
>
> On 15/04/2025 12:01, Daniel Wagner wrote:
> > On Mon, Apr 14, 2025 at 01:29:42AM +0300, Sagi Grimberg wrote:
> >> Was this resolved?
> > I've added a retry loop in libnvme when the write to /dev/nvme-fabrics
> > returns EINTR. It takes a few days until the newly version hits the test
> > frameworks and we know for sure that is handled.
> >
> >> Couldn't follow where the issue was - kernel/userspace?
> >    [ 10.397892] nvme_tcp: queue 0: failed to receive icresp, error -4
> >
> > Stable kernels started to return EINTR and one change which touches
> > this area is:
> >
> >    578539e09690 ("nvme-tcp: fix connect failure on receiving partial ICResp PDU")
> >
> > Unfortunately, it's not easy for Luca to build a test kernel. If we
> > provide one for his distro, he would be willing to tests.
>
> Its not great that we now get sporadic EINTR errors. I am wandering what
> is triggering this?
>
> Caleb, did you see this?

Yes, I did see it. I don't understand where EINTR could come from
aside from the userspace process receiving a signal while waiting for
the ICResp (or hitting the 10 second timeout waiting to receive the
PDU). I don't understand how the switch to MSG_WAITALL would affect
this. A tcpdump might be helpful to understand whether the controller
is actually sending the full ICResp timely.

Best,
Caleb