[PATCH] nvme: don't set a virt_boundary unless needed
Sagi Grimberg
sagi at grimberg.me
Thu Dec 21 04:32:35 PST 2023
On 12/21/23 14:17, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 11:30:38AM +0200, Sagi Grimberg wrote:
>>
>>> NVMe PRPs are a pain and force the expensive virt_boundary checking on
>>> block layer, prevent secure passthrough and require scatter/gather I/O
>>> to be split into multiple commands which is problematic for the upcoming
>>> atomic write support.
>>
>> But is the threshold still correct? meaning for I/Os small enough the
>> device will have lower performance? I'm not advocating that we keep it,
>> but we should at least mention the tradeoff in the change log.
>
> Chaitanya benchmarked it on the first generation of devices that
> supported SGLs. On the only SGL-enabled device I have there is no
> performance penality for using SGLs on small transfer, but I'd love
> to see numbers from other setups.
Someone that cares should speak up I assume.
Still the change log should indicate that there is no threshold for
sgl/prp anymore.
>>> For nvme-rdma the virt boundary is always required, as RMDA MRs are just
>>> as dumb as NVMe PRPs.
>>
>> That is actually device dependent. The driver can ask for a pool of
>> mrs with type IB_MR_TYPE_SG_GAPS if the device supports IBK_SG_GAPS_REG.
>>
>> See from ib_srp.c:
>> --
>> if (device->attrs.kernel_cap_flags & IBK_SG_GAPS_REG)
>> mr_type = IB_MR_TYPE_SG_GAPS;
>> else
>> mr_type = IB_MR_TYPE_MEM_REG;
>> --
>
> For that we'd need to support IB_MR_TYPE_SG_GAPS gaps first, which can
> be done as an incremental improvement.
You actually don't need anything afair, you just add this code and
stop setting the virt boundary.
>>> + /*
>>> + * nvme-apple always uses PRPs and thus needs to set a virt boundary.
>>> + */
>>> + set_bit(NVME_CTRL_VIRT_BOUNDARY_IO, &anv->ctrl.flags);
>>> + set_bit(NVME_CTRL_VIRT_BOUNDARY_ADMIN, &anv->ctrl.flags);
>>> +
>>
>> Why two flags? Why can't the core just always set the blk virt boundary
>> on the admin request queue?
>
> It could, and given that the admin queue isn't performance critical it
> probably won't hurt in reality. But why enforce a really weird limit
> on the queue if there is no reason for it? The only transport that
> treats the admin queue different is PCIe, and that's just a spec
> oddity.
Exactly because its odd. Unless there is any benefit of using sgls in
admin commands lets not flag it per transport.
More information about the Linux-nvme
mailing list