[PATCH] nvme: don't set a virt_boundary unless needed

Thu Dec 21 04:32:35 PST 2023

On 12/21/23 14:17, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 11:30:38AM +0200, Sagi Grimberg wrote:
>>
>>> NVMe PRPs are a pain and force the expensive virt_boundary checking on
>>> block layer, prevent secure passthrough and require scatter/gather I/O
>>> to be split into multiple commands which is problematic for the upcoming
>>> atomic write support.
>>
>> But is the threshold still correct? meaning for I/Os small enough the
>> device will have lower performance? I'm not advocating that we keep it,
>> but we should at least mention the tradeoff in the change log.
> 
> Chaitanya benchmarked it on the first generation of devices that
> supported SGLs.  On the only SGL-enabled device I have there is no
> performance penality for using SGLs on small transfer, but I'd love
> to see numbers from other setups.

Someone that cares should speak up I assume.
Still the change log should indicate that there is no threshold for
sgl/prp anymore.

>>> For nvme-rdma the virt boundary is always required, as RMDA MRs are just
>>> as dumb as NVMe PRPs.
>>
>> That is actually device dependent. The driver can ask for a pool of
>> mrs with type IB_MR_TYPE_SG_GAPS if the device supports IBK_SG_GAPS_REG.
>>
>> See from ib_srp.c:
>> --
>>         if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
>>                  mr_type = IB_MR_TYPE_SG_GAPS;
>>          else
>>                  mr_type = IB_MR_TYPE_MEM_REG;
>> --
> 
> For that we'd need to support IB_MR_TYPE_SG_GAPS gaps first, which can
> be done as an incremental improvement.

You actually don't need anything afair, you just add this code and
stop setting the virt boundary.

>>>    +	/*
>>> +	 * nvme-apple always uses PRPs and thus needs to set a virt boundary.
>>> +	 */
>>> +	set_bit(NVME_CTRL_VIRT_BOUNDARY_IO, &anv->ctrl.flags);
>>> +	set_bit(NVME_CTRL_VIRT_BOUNDARY_ADMIN, &anv->ctrl.flags);
>>> +
>>
>> Why two flags? Why can't the core just always set the blk virt boundary
>> on the admin request queue?
> 
> It could, and given that the admin queue isn't performance critical it
> probably won't hurt in reality.  But why enforce a really weird limit
> on the queue if there is no reason for it?  The only transport that
> treats the admin queue different is PCIe, and that's just a spec
> oddity.

Exactly because its odd. Unless there is any benefit of using sgls in
admin commands lets not flag it per transport.