nvme-tcp: Queue deadlock (stuck PDU) on NVMe TCP host driver

Samuel Jones sjones at kalrayinc.com
Thu Dec 10 15:21:26 EST 2020


Hi everyone,

We are observing an issue when testing the Linux NVMe TCP host driver with a Kalray target, on Write workloads that are above the inline_data_size threshold. Sometimes the NVMe TCP host driver never replies to an R2T sent by the target, causing the command to time out and the connection to be destroyed. We have been able to verify that the R2T PDU is received by the Host and acknowledged to the target by the TCP stack, and that the R2T PDU is seen by nvme_tcp_handle_r2t(). However, the H2CData created by nvme_tcp_handle_r2t() is never pushed back into the TCP stack. We're using Linux 5.8.17 on the host, but the relevant code paths haven't changed significantly since as far as I can tell. The same tests using a different host stack (SPDK) work fine.

My understanding of the issue is as follows, I'd be grateful for any clarifications you can give me: Data reception and transmission is handled by a single function nvme_tcp_io_work which is scheduled on a workqueue. It can be scheduled by several sources:

1. RX Data on the TCP socket
2. New TX space on the TCP socket
3. nvme_tcp_queue_request(): the driver has a new PDU to send
4. nvme_tcp_io_work reschedules itself if it thinks there is more work to be done

This means that the nvme_tcp_io_work can be called in parallel on different threads. To handle this, there is a mutex (send_mutex) for the transmission routines and the socket lock is taken for reception. However, different threads do not spin on the send_mutex, they simply call mutex_trylock(), and if they fail to take the lock, they just skip the TX stage. However, I believe that there is a race around the use of the send_mutex which can lead to data in the send_list being ignored, potentially indefinitely. Consider the following:

Thread 0                                                         |  Thread 1
nvme_tcp_handle_r2t (from io_work)
---------------------------------------------------------------->
                                                                 nvme_tcp_queue_rq (from blk io)
                                                                 nvme_tcp_queue_request: adds entry to list
                                                                 takes send_mutex
                                                                 goes to sleep in kernel_sendpage
<----------------------------------------------------------------
gets queue->lock, adds entry into list
schedules workqueue
exit io_work
re-enters io_work
fails to get send_mutex
exits io_work without rescheduling
                                                                 sends one PDU from the list
                                                                 releases send_mutex
                                                                 exits without rescheduling

If this occurs, the H2CData PDU will not be sent until io_work is rescheduled. If we are lucky, this will happen soon, because of other activity on the socket (new responses from the target for example). If we are not, the H2CData PDU will never be sent and the queue is deadlocked. Eventually the block io timeout will kick in and tear down the queue.

I have not been able to prove formally that the above sequence is what we are observing. I have been able to test rescheduling the workqueue if the call to mutex_trylock() in nvme_tcp_try_send fails, which fixes the issue for us.

Furthermore, I have some doubts about the fact that nvme_tcp_queue_request does not check the return code of nvme_tcp_try_send. I believe the author assumed that, given the list is supposed to be non-empty, that there is no point in checking the return value. However, it appears to be me that nvme_tcp_try_send can return a transient failure - there is a code path that checks for EAGAIN in the sub-functions and returns 0 in this case. Surely we need to check the return value of nvme_tcp_try_send in nvme_tcp_queue_request and reschedule the workqueue if ret != 1 ? What's more, nvme_tcp_io_work only reschedules the workqueue if ret > 0, so we still have a problem if nvme_tcp_try_send morphs EAGAIN into 0, as the nvme_tcp_io_work won't reschedule the io_work...

What do you think? I'd be grateful for any comments you may have this.
Best Regards
Samuel Jones




More information about the Linux-nvme mailing list