[PATCH v4] nvme-tcp: fix connect failure on receiving partial ICResp PDU
Caleb Sander
csander at purestorage.com
Mon Jan 27 09:38:53 PST 2025
On Sun, Jan 26, 2025 at 11:37 PM Hannes Reinecke <hare at suse.de> wrote:
>
> On 1/24/25 19:43, Caleb Sander Mateos wrote:
> > nvme_tcp_init_connection() attempts to receive an ICResp PDU but only
> > checks that the return value from recvmsg() is non-negative. If the
> > sender closes the TCP connection or sends fewer than 128 bytes, this
> > check will pass even though the full PDU wasn't received.
> >
> > Ensure the full ICResp PDU is received by checking that recvmsg()
> > returns the expected 128 bytes.
> >
> > Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could
> > split the ICResp over multiple TCP frames. Without MSG_WAITALL,
> > recvmsg() could return prematurely with only part of the PDU.
> >
> > Signed-off-by: Caleb Sander Mateos <csander at purestorage.com>
> > Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
> > ---
> > v4: keep recvmsg() error return value
> > v3: fix return value to indicate error
> > v2: add Fixes tag
> >
> > drivers/nvme/host/tcp.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index e9ff6babc540..56679eb8c0d6 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1446,15 +1446,18 @@ static int nvme_tcp_init_connection(struct nvme_tcp_queue *queue)
> > iov.iov_len = sizeof(*icresp);
> > if (nvme_tcp_queue_tls(queue)) {
> > msg.msg_control = cbuf;
> > msg.msg_controllen = sizeof(cbuf);
> > }
> > + msg.msg_flags = MSG_WAITALL;
> > ret = kernel_recvmsg(queue->sock, &msg, &iov, 1,
> > iov.iov_len, msg.msg_flags);
>
> But won't we have to wait for a TCP timeout now if the sender sends less
> than 128 bytes? With this patch we always wait for 128 bytes, and
> possibly wait for TCP timeout if not.
Yes, if the NVMe/TCP controller sends less than 128 bytes, we need to
wait for it to send the remainder of the ICResp PDU. That's just how
the NVMe/TCP protocol works. If we want to protect against
buggy/malicious controllers that don't send a full ICResp, we need a
timeout mechanism. That's the purpose of the existing
`queue->sock->sk->sk_rcvtimeo = 10 * HZ;` in nvme_tcp_alloc_queue().
Note that recvmsg() timing out was already possible in the original
code if the controller didn't send anything on the TCP connection
after accepting it.
> Testcase for this would be nice ...
>
> And I need to check if secure concatenation is affected here; with
> secure concatenation we need to peek at the first packet to check
> if it's an ICRESP or a TLS negotiation.
Are you saying that with secure concatenation we don't know in advance
whether the connection is using TLS between the TCP and NVMe/TCP
protocol layers? Wouldn't the host already need to know that when it
sent its ICReq PDU?
If TLS is being used, my assumption was that `struct kvec iov` will
receive the decrypted (NVMe/TCP protocol) bytes rather than the
encrypted (TLS protocol) bytes. If that's not the case, then I think
there would be another existing bug, as the code interprets the
received bytes as an ICResp PDU regardless of whether TLS is in use.
Thanks,
Caleb
More information about the Linux-nvme
mailing list