Request timeout seen with NVMEoF TCP

Sagi Grimberg sagi at grimberg.me
Mon Dec 14 15:13:13 EST 2020


> Hi Sagi,

Hey Sam,

> Without instrumenting the driver it's hard to say what might be happening here.

Yes, that would be the next step.

  But I did make a few comments at the end of my initial email which 
might be relevant:
> 
> 1. It seems abnormal to me that the direct send path does not check the return value of nvme_tcp_try_send(), given that this routine at least claims to be able to fail transiently. nvme_tcp_queue_request should reschedule the workqueue if nvme_tcp_try_send() does not return 1, IMHO.
> 
> 2. If, for whatever reason, kernel_sendpage or sock_no_sendpage returns -EAGAIN, the nvme_tcp_try_send() returns 0. This will be interpreted by nvme_tcp_io_work() as meaning there is nothing more to do. This is wrong, because in fact there is more work to do and nvme_tcp_io_work() should reschedule itself in this case. Unfortunately, as the system is coded, nvme_tcp_io_work() has no way of distinguishing between "nvme_tcp_try_send returned 0 because there is nothing to do" and "nvme_tcp_try_send returned 0 because of a transient failure".
> 
> Not sure how possible these cases are in practice but theoretically they can occur..

The design is that if sendpage fails with a transient error (e.g.
EAGAIN) the socket buffer is full and we're guaranteed to be called
with nvme_tcp_write_space when the socket makes additional room, when
write_space is called, we schedule io_work to resume its work, hence
the direct send path is not rescheduling the io_work in this case.

It may be possible that write_space is called when the socket buffer has
some more room, but it immediately gets filled with the direct send path
but then the io_work should be triggered and with the proposed fix it
should reschedule itself if it wasn't able to acquire the send_mutex.

But apparently there is a different race here that we're not seeing
at the moment...



More information about the Linux-nvme mailing list