A question regarding "multiple SGL"

Sagi Grimberg sagi at grimberg.me
Thu Oct 27 02:02:20 PDT 2016

> Hi Robert,

Hey Robert, Christoph,

> please explain your use cases that isn't handled.  The one and only
> reason to set MSDBD to 1 is to make the code a lot simpler given that
> there is no real use case for supporting more.
> RDMA uses memory registrations to register large and possibly
> discontiguous data regions for a single rkey, aka single SGL descriptor
> in NVMe terms.  There would be two reasons to support multiple SGL
> descriptors:  a) to support a larger I/O size than supported by a single
> MR, or b) to support a data region format not mappable by a single
> MR.
> iSER only supports a single rkey (or stag in IETF terminology) and has
> been doing fine on a) and mostly fine on b).   There are a few possible
> data layouts not supported by the traditional IB/iWarp FR WRs, but the
> limit is in fact exactly the same as imposed by the NVMe PRPs used for
> PCIe NVMe devices, so the Linux block layer has support to not generate
> them.  Also with modern Mellanox IB/RoCE hardware we can actually
> register completely arbitrary SGLs.  iSER supports using this registration
> mode already with a trivial code addition, but for NVMe we didn't have a
> pressing need yet.

Good explanation :)

The IO transfer size is a bit more pressing on some devices (e.g.
cxgb3/4) where the number of pages per-MR can be indeed lower than
a reasonable transfer size (Steve can correct me if I'm wrong).

However, if there is a real demand for this we'll happily accept
patches :)

Just a note, having this feature in-place can bring unexpected behavior
depending on how we implement it:
- If we can use multiple MRs per IO (for multiple SGLs) we can either
prepare for the worst-case and allocate enough MRs to satisfy the
various IO patterns. This will be much heavier in terms of resource
allocation and can limit the scalability of the host driver.
- Or we can implement a shared MR pool with a reasonable number of MRs.
In this case each IO can consume one or more MRs on the expense of
other IOs. In this case we may need to requeue the IO later when we
have enough available MRs to satisfy the IO. This can yield some
unexpected performance gaps for some workloads.


More information about the Linux-nvme mailing list