[PATCH] nvme: don't set a virt_boundary unless needed
Sagi Grimberg
sagi at grimberg.me
Mon Dec 25 02:08:43 PST 2023
On 12/22/23 03:16, Max Gurtovoy wrote:
>
>
> On 21/12/2023 11:30, Sagi Grimberg wrote:
>>
>>> NVMe PRPs are a pain and force the expensive virt_boundary checking on
>>> block layer, prevent secure passthrough and require scatter/gather I/O
>>> to be split into multiple commands which is problematic for the upcoming
>>> atomic write support.
>>
>> But is the threshold still correct? meaning for I/Os small enough the
>> device will have lower performance? I'm not advocating that we keep it,
>> but we should at least mention the tradeoff in the change log.
>>
>>> Fix the NVMe core to require an opt-in from the drivers for it.
>>>
>>> For nvme-apple it is always required as the driver only supports PRPs.
>>>
>>> For nvme-pci when SGLs are supported we'll always use them for data I/O
>>> that would require a virt_boundary.
>>>
>>> For nvme-rdma the virt boundary is always required, as RMDA MRs are just
>>> as dumb as NVMe PRPs.
>>
>> That is actually device dependent. The driver can ask for a pool of
>> mrs with type IB_MR_TYPE_SG_GAPS if the device supports IBK_SG_GAPS_REG.
>>
>> See from ib_srp.c:
>> --
>> if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
>> mr_type = IB_MR_TYPE_SG_GAPS;
>> else
>> mr_type = IB_MR_TYPE_MEM_REG;
>
> For now, I prefer not using the IB_MR_TYPE_SG_GAPS MR in NVMe/RDMA since
> in the case of virtual contiguous data buffers it is better to use
> IB_MR_TYPE_MEM_REG. It gives much better performance. This is the reason
> I didn't add IB_MR_TYPE_SG_GAPS MR support for NVMe/RDMA.
I see. I guess it is not *that* trivial then.
> I actually had a plan to re-write the IB_MR_TYPE_SG_GAPS MR logic (or
> create a new MR type) that will internally open 2 MRs so if the IO is
> contiguous it will use the MTT/MEM_REG and if it isn't it will use the
> KLM/SG_GAPS.
> This is how we implemented the SIG_MR but still didn't make it for the
> IB_MR_TYPE_SG_GAPS MR.
Sounds like a reasonable option. But doesn't think mean that the
driver will need to scan the page scatterlist to determine what internal
mr to use? Even a fallback mechanism can be affected by a given
workload. Plus there is the cost of doubling the number of preallocated
mrs.
> Actually, I think we should have the same logic in the NVMe PCI driver:
> if the IOs can be delivered as PRPs then the driver will prepare SQE
> with PRP. Otherwise, driver will prepare SGL.
> I think that doing the check in the driver for each IO is not so bad and
> devices will get benefit from it. Usually HW devices like to work with
> contiguous buffers. If the buffers can't be mapped with PRPs, then the
> HW will work a bit harder and use SGLs (it is better than doing a bounce
> buffer in the block layer).
>
> I actually did a POC internally for NVMe/RDMA and created sg_gaps ib_mr
> and mem_reg ib_mr and checked the buffers mapping for each IO and got a
> big benefit if the buffers were discontig (used the sg_gaps mr). Also
> the contig buffers performance didn't degraded because of the check of
> the buffers mapping.
>
> I created a fio flags that in purpose sends discontig IOs for my testing.
>
> WDYT ?
Sounds possible. However for rdma we probably want this transparent to
the ulp such that all consumers can have this benefit. Also perhaps add
this logic in the rdma core so other drivers can use it as well
(although I don't know if any other rdma driver supports sg gaps
anyways).
If this proves to be a good approach, pci can do something similar.
More information about the Linux-nvme
mailing list