[RFC 0/7] Introduce swiotlb throttling

Thu Aug 29 20:58:16 PDT 2024

From: Robin Murphy <robin.murphy at arm.com> Sent: Wednesday, August 28, 2024 12:50 PM
> 
> On 2024-08-28 2:03 pm, Petr Tesařík wrote:
> > On Wed, 28 Aug 2024 13:02:31 +0100
> > Robin Murphy <robin.murphy at arm.com> wrote:
> >
> >> On 2024-08-22 7:37 pm, mhkelley58 at gmail.com wrote:
> >>> From: Michael Kelley <mhklinux at outlook.com>
> >>>
> >>> Background
> >>> ==========
> >>> Linux device drivers may make DMA map/unmap calls in contexts that
> >>> cannot block, such as in an interrupt handler. Consequently, when a
> >>> DMA map call must use a bounce buffer, the allocation of swiotlb
> >>> memory must always succeed immediately. If swiotlb memory is
> >>> exhausted, the DMA map call cannot wait for memory to be released. The
> >>> call fails, which usually results in an I/O error.
> >>>
> >>> Bounce buffers are usually used infrequently for a few corner cases,
> >>> so the default swiotlb memory allocation of 64 MiB is more than
> >>> sufficient to avoid running out and causing errors. However, recently
> >>> introduced Confidential Computing (CoCo) VMs must use bounce buffers
> >>> for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> >>> a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> >>> swiotlb memory. This large allocation reduces the likelihood of a
> >>> spike in usage causing DMA map failures. Unfortunately for most
> >>> workloads, this insurance against spikes comes at the cost of
> >>> potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> >>> memory can't be used for other purposes.
> >>>
> >>> Approach
> >>> ========
> >>> The goal is to significantly reduce the amount of memory reserved as
> >>> swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> >>> DMA map failures due to memory exhaustion.
> >>
> >> Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> >> meant to address? Of course the implementation of that is still young
> >> and has plenty of scope to be made more effective, and some of the ideas
> >> here could very much help with that, but I'm struggling a little to see
> >> what's really beneficial about having a completely disjoint mechanism
> >> for sitting around doing nothing in the precise circumstances where it
> >> would seem most possible to allocate a transient buffer and get on with it.
> >
> > This question can be probably best answered by Michael, but let me give
> > my understanding of the differences. First the similarity: Yes, one
> > of the key new concepts is that swiotlb allocation may block, and I
> > introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> > was later dropped, but dynamic SWIOTLB would still benefit from it.
> >
> > More importantly, dynamic SWIOTLB may deplete memory following an I/O
> > spike. I do have some ideas how memory could be returned back to the
> > allocator, but the code is not ready (unlike this patch series).
> > Moreover, it may still be a better idea to throttle the devices
> > instead, because returning DMA'able memory is not always cheap. In a
> > CoCo VM, this memory must be re-encrypted, and that requires a
> > hypercall that I'm told is expensive.
> 
> Sure, making a hypercall in order to progress is expensive relative to
> being able to progress without doing that, but waiting on a lock for an
> unbounded time in the hope that other drivers might release their DMA
> mappings soon represents a potentially unbounded expense, since it
> doesn't even carry any promise of progress at all 

FWIW, the implementation in this patch set guarantees forward
progress for throttled requests as long as drivers that use MAY_BLOCK
are well-behaved.

> - oops userspace just
> filled up SWIOTLB with a misguided dma-buf import and now the OS has
> livelocked on stalled I/O threads fighting to retry :(
> 
> As soon as we start tracking thresholds etc. then that should equally
> put us in the position to be able to manage the lifecycle of both
> dynamic and transient pools more effectively - larger allocations which
> can be reused by multiple mappings until the I/O load drops again could
> amortise that initial cost quite a bit.

I'm not understanding what you envision here. Could you elaborate?
With the current implementation of SWIOTLB_DYNAMIC, dynamic
pools are already allocated with size MAX_PAGE_ORDER (or smaller
if that size isn't available). That size really isn't big enough in CoCo
VMs with more than 16 vCPUs since we want to split the allocation
into per-CPU areas. To fix this, we would need to support swiotlb
pools that are stitched together from multiple contiguous physical
memory ranges. That probably could be done, but I don't see how
it's related to thresholds.

> 
> Furthermore I'm not entirely convinced that the rationale for throttling
> being beneficial is even all that sound. Serialising requests doesn't
> make them somehow use less memory, it just makes them use it...
> serially. If a single CPU is capable of queueing enough requests at once
> to fill the SWIOTLB, this is going to do absolutely nothing; if two CPUs
> are capable of queueing enough requests together to fill the SWIOTLB,
> making them take slightly longer to do so doesn't inherently mean
> anything more than reaching the same outcome more slowly.

I don't get your point. My intent with throttling is that it caps the
system-wide high-water mark for swiotlb memory usage, without
causing I/O errors due to DMA map failures. Without
SWIOTLB_DYNAMIC, the original boot-time allocation size is the limit
for swiotlb memory usage, and DMA map fails if the system-wide
high-water mark tries to rise above that limit. With SWIOTLB_DYNAMIC,
the current code continues to allocate additional system memory and
turn it into swiotlb memory, with no limit. There probably *should*
be a limit, even for SWIOTLB_DYNAMIC.

I've run "fio" loads with and without throttling as implemented in this
patch set. Without SWIOTLB_DYNAMIC and no throttling, it's pretty
easy to reach the limit and get I/O errors due to DMA map failure. With
throttling and the same "fio" load, the usage high-water mark stays
near the throttling threshold, with no I/O errors. The limit should be
set large enough for a workload to operate below the throttling
threshold. But if the threshold is exceeded, throttling should avoid a
big failure due to DMA map failures.

My mental model here is somewhat like blk-mq tags. There's a fixed
number allocated with the storage controller. Block I/O requests must
get a tag, and if one isn't available, the requesting thread is pended
until one becomes available. The fixed number of tags is the limit, but
the requestor doesn't get an error if a tag isn't available -- it just
waits. The fixed number of tags necessarily imposes a kind of
resource limit on block I/O requests, rather than just always allocating
an additional tag if there's a request that can't get an existing tag. I think
the same model makes sense for swiotlb memory when the device
driver can support it.

> At worst, if a
> thread is blocked from polling for completion and releasing a bunch of
> mappings of already-finished descriptors because it's stuck on an unfair
> lock trying to get one last one submitted, then throttling has actively
> harmed the situation.

OK, yes, I can understand there might be an issue with a driver (like
NVMe) that supports polling. I'll look at that more closely and see
if there is.

> 
> AFAICS this is dependent on rather particular assumptions of driver
> behaviour in terms of DMA mapping patterns and interrupts, plus the
> overall I/O workload shape, and it's not clear to me how well that
> really generalises.
> 
> > In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> > than to grow the swiotlb.
> 
> I'm not necessarily disputing that for the cases where the assumptions
> do hold, it's still more a question of why those two things should be
> separate and largely incompatible (I've only skimmed the patches here,
> but my impression is that it doesn't look like they'd play all that
> nicely together if both enabled).

As I've mulled over your comments the past day, I'm not sure the two
things really are incompatible or even overlapping. To me it seems like
SWIOTLB_DYNAMIC is about whether the swiotlb memory is pre-allocated
at boot time, or allocated as needed. But SWIOTLB_DYNAMIC currently
doesn't have a limit to how much it will allocate, and it probably should.
(Though SWIOTLB_DYNAMIC has a limit imposed upon it if there's isn't
enough contiguous physical memory to grow the swiotlb pool.) Given a
limit, both the pre-allocate case and allocate-as-needed case have the
same question about what to do when the limit is reached. In both cases,
we're generally forced to set the limit pretty high, because DMA map
failures occur if you broach the limit. Throttling is about dealing with
the limit in a better way when permitted by the driver. That in turn
allows setting a tighter limit and not having to overprovision the
swiotlb memory.

> To me it would make far more sense for
> this to be a tuneable policy of a more holistic SWIOTLB_DYNAMIC itself,
> i.e. blockable calls can opportunistically wait for free space up to a
> well-defined timeout, but then also fall back to synchronously
> allocating a new pool in order to assure a definite outcome of success
> or system-is-dying-level failure.

Yes, I can see value in the kind of interaction you describe before any
limits are approached. But to me the primary question is dealing better
with the limit.

Michael 

> 
> Thanks,
> Robin.