[RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize

Max Gurtovoy mgurtovoy at nvidia.com
Sat Dec 23 17:37:02 PST 2023



On 22/12/2023 8:58, Guixin Liu wrote:
> 
> 在 2023/12/21 03:27, Sagi Grimberg 写道:
>>
>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>       }
>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>
>>>> rdma_dev_max_qsize is a better name.
>>>>
>>>> Also, you can drop the RFC for the next submission.
>>>>
>>>
>>> Sagi,
>>> I don't feel comfortable with these patches.
>>
>> Well, good that you're speaking up then ;)
>>
>>> First I would like to understand the need for it.
>>
>> I assumed that he stumbled on a device that did not support the
>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>
> The situation is that I need a queue depth greater than 128.
>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>> doesn't take it into account.
>>
>> Well, it is not taken into account now either with the existing magic
>> limit in nvmet. The rdma limits reporting mechanism was and still is
>> unusable.
>>
>> I would expect a device that has different size for different work
>> items to report max_qp_wr accounting for the largest work element that
>> the device supports, so it is universally correct.
>>
>> The fact that max_qp_wr means the maximum number of slots is a qp and
>> at the same time different work requests can arbitrarily use any number
>> of slots without anyone ever knowing, makes it pretty much impossible to
>> use reliably.
>>
>> Maybe rdma device attributes need a new attribute called
>> universal_max_qp_wr that is going to actually be reliable and not
>> guess-work?
> 
> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
> another way to get a queue depth grater than 128
> 
> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
> 

When I added this limit to RDMA transports it was to avoid a situation 
that a QP will fail to be created if one will ask a large queue.

I choose 128 since it was supported for all the RDMA adapters I've 
tested in my lab (mostly Mellanox adapters).
For this queue depth we found that the performance is good enough and it 
will not be improved if we will increase the depth.

Are you saying that you have a device that can provide better 
performance with qdepth > 128 ?
What is the tested qdepth and what are the numbers you see with this 
qdepth ?



More information about the Linux-nvme mailing list