[RFC PATCH V2 2/2] nvme: rdma: use ib_device's max_qp_wr to limit sqsize

Mon Dec 25 00:40:22 PST 2023

在 2023/12/24 09:37, Max Gurtovoy 写道:
>
>
> On 22/12/2023 8:58, Guixin Liu wrote:
>>
>> 在 2023/12/21 03:27, Sagi Grimberg 写道:
>>>
>>>>>> @@ -1030,11 +1030,13 @@ static int nvme_rdma_setup_ctrl(struct 
>>>>>> nvme_rdma_ctrl *ctrl, bool new)
>>>>>>               ctrl->ctrl.opts->queue_size, ctrl->ctrl.sqsize + 1);
>>>>>>       }
>>>>>> -    if (ctrl->ctrl.sqsize + 1 > NVME_RDMA_MAX_QUEUE_SIZE) {
>>>>>> +    ib_max_qsize = ctrl->device->dev->attrs.max_qp_wr /
>>>>>> +            (NVME_RDMA_SEND_WR_FACTOR + 1);
>>>>>
>>>>> rdma_dev_max_qsize is a better name.
>>>>>
>>>>> Also, you can drop the RFC for the next submission.
>>>>>
>>>>
>>>> Sagi,
>>>> I don't feel comfortable with these patches.
>>>
>>> Well, good that you're speaking up then ;)
>>>
>>>> First I would like to understand the need for it.
>>>
>>> I assumed that he stumbled on a device that did not support the
>>> existing max of 128 nvme commands (which is 384 rdma wrs for the qp).
>>>
>> The situation is that I need a queue depth greater than 128.
>>>> Second, the QP WR can be constructed from one or more WQEs and the 
>>>> WQEs can be constructed from one or more WQEBBs. The max_qp_wr 
>>>> doesn't take it into account.
>>>
>>> Well, it is not taken into account now either with the existing magic
>>> limit in nvmet. The rdma limits reporting mechanism was and still is
>>> unusable.
>>>
>>> I would expect a device that has different size for different work
>>> items to report max_qp_wr accounting for the largest work element that
>>> the device supports, so it is universally correct.
>>>
>>> The fact that max_qp_wr means the maximum number of slots is a qp and
>>> at the same time different work requests can arbitrarily use any number
>>> of slots without anyone ever knowing, makes it pretty much 
>>> impossible to
>>> use reliably.
>>>
>>> Maybe rdma device attributes need a new attribute called
>>> universal_max_qp_wr that is going to actually be reliable and not
>>> guess-work?
>>
>> I see, the max_qp_wr is not as reliable as I imagined. Is there any 
>> another way to get a queue depth grater than 128
>>
>> instead of changing NVME_RDMA_MAX_QUEUE_SIZE?
>>
>
> When I added this limit to RDMA transports it was to avoid a situation 
> that a QP will fail to be created if one will ask a large queue.
>
> I choose 128 since it was supported for all the RDMA adapters I've 
> tested in my lab (mostly Mellanox adapters).
> For this queue depth we found that the performance is good enough and 
> it will not be improved if we will increase the depth.
>
> Are you saying that you have a device that can provide better 
> performance with qdepth > 128 ?
> What is the tested qdepth and what are the numbers you see with this 
> qdepth ?

Yeah, you are right, the improvement is small(about %1~2%), I do this 
only for better benchmark,

I still consist that using the capabilities of RDMA device to determine 
the size of queue is a better choice, but now I change the

NVME_RDMA_MAX_QUEUE_SIZE to 256 for bidding.

Best regards,

Guixin Liu