[PATCH] nvme: don't set a virt_boundary unless needed

Thu Dec 21 17:16:28 PST 2023

On 21/12/2023 11:30, Sagi Grimberg wrote:
> 
>> NVMe PRPs are a pain and force the expensive virt_boundary checking on
>> block layer, prevent secure passthrough and require scatter/gather I/O
>> to be split into multiple commands which is problematic for the upcoming
>> atomic write support.
> 
> But is the threshold still correct? meaning for I/Os small enough the
> device will have lower performance? I'm not advocating that we keep it,
> but we should at least mention the tradeoff in the change log.
> 
>> Fix the NVMe core to require an opt-in from the drivers for it.
>>
>> For nvme-apple it is always required as the driver only supports PRPs.
>>
>> For nvme-pci when SGLs are supported we'll always use them for data I/O
>> that would require a virt_boundary.
>>
>> For nvme-rdma the virt boundary is always required, as RMDA MRs are just
>> as dumb as NVMe PRPs.
> 
> That is actually device dependent. The driver can ask for a pool of
> mrs with type IB_MR_TYPE_SG_GAPS if the device supports IBK_SG_GAPS_REG.
> 
> See from ib_srp.c:
> -- 
>         if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
>                  mr_type = IB_MR_TYPE_SG_GAPS;
>          else
>                  mr_type = IB_MR_TYPE_MEM_REG;

For now, I prefer not using the IB_MR_TYPE_SG_GAPS MR in NVMe/RDMA since 
in the case of virtual contiguous data buffers it is better to use 
IB_MR_TYPE_MEM_REG. It gives much better performance. This is the reason 
I didn't add IB_MR_TYPE_SG_GAPS MR support for NVMe/RDMA.

I actually had a plan to re-write the IB_MR_TYPE_SG_GAPS MR logic (or 
create a new MR type) that will internally open 2 MRs so if the IO is 
contiguous it will use the MTT/MEM_REG and if it isn't it will use the 
KLM/SG_GAPS.
This is how we implemented the SIG_MR but still didn't make it for the 
IB_MR_TYPE_SG_GAPS MR.

Actually, I think we should have the same logic in the NVMe PCI driver:
if the IOs can be delivered as PRPs then the driver will prepare SQE 
with PRP. Otherwise, driver will prepare SGL.
I think that doing the check in the driver for each IO is not so bad and 
devices will get benefit from it. Usually HW devices like to work with 
contiguous buffers. If the buffers can't be mapped with PRPs, then the 
HW will work a bit harder and use SGLs (it is better than doing a bounce 
buffer in the block layer).

I actually did a POC internally for NVMe/RDMA and created sg_gaps ib_mr 
and mem_reg ib_mr and checked the buffers mapping for each IO and got a 
big benefit if the buffers were discontig (used the sg_gaps mr). Also 
the contig buffers performance didn't degraded because of the check of 
the buffers mapping.

I created a fio flags that in purpose sends discontig IOs for my testing.

WDYT ?

> -- 
> 
>>
>> For nvme-tcp and nvme-fc I set the flags for now because I don't
>> understand the drivers fully, but I suspect the flags could be lifted.
> 
> tcp can absolutely omit virt boundaries.
> 
>> For nvme-loop the flag is never set as it doesn't have any requirements
>> on the I/O format.
>>
>> Signed-off-by: Christoph Hellwig <hch at lst.de>
>> ---
>>   drivers/nvme/host/apple.c |  6 +++++
>>   drivers/nvme/host/core.c  | 11 ++++++++-
>>   drivers/nvme/host/fc.c    |  3 +++
>>   drivers/nvme/host/nvme.h  |  4 +++
>>   drivers/nvme/host/pci.c   | 52 ++++++++++++++++++++++-----------------
>>   drivers/nvme/host/rdma.c  |  6 +++++
>>   drivers/nvme/host/tcp.c   |  3 +++
>>   7 files changed, 61 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/nvme/host/apple.c b/drivers/nvme/host/apple.c
>> index 596bb11eeba5a9..a1afb54e3b4da8 100644
>> --- a/drivers/nvme/host/apple.c
>> +++ b/drivers/nvme/host/apple.c
>> @@ -1116,6 +1116,12 @@ static void apple_nvme_reset_work(struct 
>> work_struct *work)
>>           goto out;
>>       }
>> +    /*
>> +     * nvme-apple always uses PRPs and thus needs to set a virt 
>> boundary.
>> +     */
>> +    set_bit(NVME_CTRL_VIRT_BOUNDARY_IO, &anv->ctrl.flags);
>> +    set_bit(NVME_CTRL_VIRT_BOUNDARY_ADMIN, &anv->ctrl.flags);
>> +
> 
> Why two flags? Why can't the core just always set the blk virt boundary
> on the admin request queue?

I also think that a single flag is good enough.
> 
>>       ret = nvme_init_ctrl_finish(&anv->ctrl, false);
>>       if (ret)
>>           goto out
>