[RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize

Mon Dec 25 00:59:28 PST 2023

>>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>>       }
>>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>>
>>>>>> rdma_dev_max_qsize is a better name.
>>>>>>
>>>>>> Also, you can drop the RFC for the next submission.
>>>>>>
>>>>>
>>>>> Sagi,
>>>>> I don't feel comfortable with these patches.
>>>>
>>>> Well, good that you're speaking up then ;)
>>>>
>>>>> First I would like to understand the need for it.
>>>>
>>>> I assumed that he stumbled on a device that did not support the
>>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>>
>>> The situation is that I need a queue depth greater than 128.
>>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>>> doesn't take it into account.
>>>>
>>>> Well, it is not taken into account now either with the existing magic
>>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>>> unusable.
>>>>
>>>> I would expect a device that has different size for different work
>>>> items to report max_qp_wr accounting for the largest work element that
>>>> the device supports, so it is universally correct.
>>>>
>>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>>> at the same time different work requests can arbitrarily use any number
>>>> of slots without anyone ever knowing, makes it pretty much 
>>>> impossible to
>>>> use reliably.
>>>>
>>>> Maybe rdma device attributes need a new attribute called
>>>> universal_max_qp_wr that is going to actually be reliable and not
>>>> guess-work?
>>>
>>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>>> another way to get a queue depth grater than 128
>>>
>>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>>
>>
>> When I added this limit to RDMA transports it was to avoid a situation 
>> that a QP will fail to be created if one will ask a large queue.
>>
>> I choose 128 since it was supported for all the RDMA adapters I've 
>> tested in my lab (mostly Mellanox adapters).
>> For this queue depth we found that the performance is good enough and 
>> it will not be improved if we will increase the depth.
>>
>> Are you saying that you have a device that can provide better 
>> performance with qdepth > 128 ?
>> What is the tested qdepth and what are the numbers you see with this 
>> qdepth ?
> 
> Yeah, you are right, the improvement is small(about %1~2%), I do this 
> only for better benchmark,

Well, it doesn't come for free, you are essentially doubling the queue
depth. I'm also assuming that you tested a single initiator and a
single queue?

> I still consist that using the capabilities of RDMA device to determine 
> the size of queue is a better choice, but now I change the
> 
> NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.

Still doesn't change the fact that its a pure guess-work if it is
supported by the device or not.

Are you even able to create that queue depth with DIF workloads?

Max, what is the maximum effective depth with DIF enabled?