nvmet-tcp timeout issue

Engel, Amit Amit.Engel at Dell.com
Mon Oct 24 12:37:45 PDT 2022


Hello Sagi et al.

We see a potential bug that leads to nvmet-tcp timeout in some cases

It appears that there are cases where kernel_sendpage() fails and nvmet_tcp_write_space() is never called
This leads to a timeout due to a 'stuck' IO

Based on the code, as long as SOCK_NOSPACE is set, upon kernel_sendpage() failure, nvmet_tcp_write_space callback is being called
And the io_work will be scheduled (queue_work_on) - NO issue
But we found a case where kernel_sendpage() fails and SOCK_NOSPACE bit is not being set.
In this case, we believe that nvmet_tcp_write_space() is never called. This will lead to a IO timeout.

In addition, it appears that some scenarios will lead to unnecessary cpu stress:
In case that nvmet_try_send_response() returns with 0
nvmet_tcp_try_send_one() will return 1 as well as nvmet_tcp_try_send()
it means that 'pending' flag in nvmet_tcp_io_work is set to 'true' and will cause unnecessary cpu stress

Will be glad to hear your feedback for both of the above issues

Thanks
Amit Engel




More information about the Linux-nvme mailing list