NVMe RDMA driver: CX4 send queue fills up when nvme queue depth is low
Samuel Jones
sjones at kalray.eu
Thu Mar 16 03:57:13 PDT 2017
Hi all,
I have a Mellanox Connect x4 that I am using as an NVMf initiator to
communicate with an NVMe device. I am running a 4.8.17 kernel, with the
vanilla drivers for the Mellanox card. When I reduce the IO queue size
exposed by my NVMe device to < 32, I get an error from the NVMe RDMA driver:
[ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
[ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
Then everything locks up.
This is because the function mlx5_wq_overflow in mlx5/qp.c has failed,
implying the send queue is full.
The same setup works with a Connect x3 Pro card.
I believe the issue is related to the following snippet in the NVMe RDMA
code:
/*
* Unsignalled send completions are another giant desaster in the
* IB Verbs spec: If we don't regularly post signalled sends
* the send queue will fill up and only a QP reset will rescue us.
* Would have been way to obvious to handle this in hardware or
* at least the RDMA stack..
*
* This messy and racy code sniplet is copy and pasted from the iSER
* initiator, and the magic '32' comes from there as well.
*
* Always signal the flushes. The magic request used for the flush
* sequencer is not allocated in our driver's tagset and it's
* triggered to be freed by blk_cleanup_queue(). So we need to
* always mark it as signaled to ensure that the "wr_cqe", which is
* embeded in request's payload, is not freed when __ib_process_cq()
* calls wr_cqe->done().
*/
if ((++queue->sig_count % 32) == 0 || flush)
wr.send_flags |= IB_SEND_SIGNALED;
The iSER initiator, as I understand it, sizes send queues with a
constant that is at least 512. The NVMe code however, sizes send queues
based on the size of the NVMe queue: in nvme_rdma_create_qp:
init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
This attribute is then used by the mlx4 and mlx5 drivers to size the HW
queue taking into account various HW-related factors. With an NVMe queue
= 16, the mlx4 driver gives me a queue of 111 elements. The mlx5 driver
gives me a queue of 85.
The send_wr_factor used in the NVMe RDMA code to take into account that
we push potentially MR, SEND and INV commands into the send queue seems
to imply that we need a minimum queue depth of 32 * 3 = 96, or that we
should not use a constant 32 for the send signalling but a min(32,
queue_depth).
Does anyone have an opinion on this issue? I'd be grateful for any help.
Samuel Jones
More information about the Linux-nvme
mailing list