[RFC 0/7] Introduce swiotlb throttling

Wed Aug 28 09:41:50 PDT 2024

On Wed, 28 Aug 2024 16:30:04 +0000
Michael Kelley <mhklinux at outlook.com> wrote:

> From: Petr Tesařík <petr at tesarici.cz> Sent: Wednesday, August 28, 2024 6:04 AM
> > 
> > On Wed, 28 Aug 2024 13:02:31 +0100
> > Robin Murphy <robin.murphy at arm.com> wrote:
> >   
> > > On 2024-08-22 7:37 pm, mhkelley58 at gmail.com wrote:  
> > > > From: Michael Kelley <mhklinux at outlook.com>
> > > >
> > > > Background
> > > > ==========
> > > > Linux device drivers may make DMA map/unmap calls in contexts that
> > > > cannot block, such as in an interrupt handler. Consequently, when a
> > > > DMA map call must use a bounce buffer, the allocation of swiotlb
> > > > memory must always succeed immediately. If swiotlb memory is
> > > > exhausted, the DMA map call cannot wait for memory to be released. The
> > > > call fails, which usually results in an I/O error.
> > > >
> > > > Bounce buffers are usually used infrequently for a few corner cases,
> > > > so the default swiotlb memory allocation of 64 MiB is more than
> > > > sufficient to avoid running out and causing errors. However, recently
> > > > introduced Confidential Computing (CoCo) VMs must use bounce buffers
> > > > for all DMA I/O because the VM's memory is encrypted. In CoCo VMs
> > > > a new heuristic allocates ~6% of the VM's memory, up to 1 GiB, for
> > > > swiotlb memory. This large allocation reduces the likelihood of a
> > > > spike in usage causing DMA map failures. Unfortunately for most
> > > > workloads, this insurance against spikes comes at the cost of
> > > > potentially "wasting" hundreds of MiB's of the VM's memory, as swiotlb
> > > > memory can't be used for other purposes.
> > > >
> > > > Approach
> > > > ========
> > > > The goal is to significantly reduce the amount of memory reserved as
> > > > swiotlb memory in CoCo VMs, while not unduly increasing the risk of
> > > > DMA map failures due to memory exhaustion.  
> > >
> > > Isn't that fundamentally the same thing that SWIOTLB_DYNAMIC was already
> > > meant to address? Of course the implementation of that is still young
> > > and has plenty of scope to be made more effective, and some of the ideas
> > > here could very much help with that, but I'm struggling a little to see
> > > what's really beneficial about having a completely disjoint mechanism
> > > for sitting around doing nothing in the precise circumstances where it
> > > would seem most possible to allocate a transient buffer and get on with it.  
> > 
> > This question can be probably best answered by Michael, but let me give
> > my understanding of the differences. First the similarity: Yes, one
> > of the key new concepts is that swiotlb allocation may block, and I
> > introduced a similar attribute in one of my dynamic SWIOTLB patches; it
> > was later dropped, but dynamic SWIOTLB would still benefit from it.
> > 
> > More importantly, dynamic SWIOTLB may deplete memory following an I/O
> > spike. I do have some ideas how memory could be returned back to the
> > allocator, but the code is not ready (unlike this patch series).
> > Moreover, it may still be a better idea to throttle the devices
> > instead, because returning DMA'able memory is not always cheap. In a
> > CoCo VM, this memory must be re-encrypted, and that requires a
> > hypercall that I'm told is expensive.
> > 
> > In short, IIUC it is faster in a CoCo VM to delay some requests a bit
> > than to grow the swiotlb.
> > 
> > Michael, please add your insights.
> > 
> > Petr T
> >   
> 
> The other limitation of SWIOTLB_DYNAMIC is that growing swiotlb
> memory requires large chunks of physically contiguous memory,
> which may be impossible to get after a system has been running a
> while. With a major rework of swiotlb memory allocation code, it might
> be possible to get by with a piecewise assembly of smaller contiguous
> memory chunks, but getting many smaller chunks could also be
> challenging.
> 
> Growing swiotlb memory also must be done as a background async
> operation if the DMA map operation can't block. So transient buffers
> are needed, which must be encrypted and decrypted on every round
> trip in a CoCo VM. The transient buffer memory comes from the
> atomic pool, which typically isn't that large and could itself become
> exhausted. So we're somewhat playing whack-a-mole on the memory
> allocation problem.

Note that this situation can be somewhat improved with the
SWIOTLB_ATTR_MAY_BLOCK flag, because a new SWIOTLB chunk can then be
allocated immediately, removing the need to allocate a transient pool
from the atomic pool.

> We discussed the limitations of SWIOTLB_DYNAMIC in large CoCo VMs
> at the time SWIOTLB_DYNAMIC was being developed, and I think there
> was general agreement that throttling would be better for the CoCo
> VM scenario.
> 
> Broadly, throttling DMA map requests seems like a fundamentally more
> robust approach than growing swiotlb memory. And starting down
> the path of allowing designated DMA map requests to block might have
> broader benefits as well, perhaps on the IOMMU path.
> 
> These points are all arguable, and your point about having two somewhat
> overlapping mechanisms is valid. Between the two, my personal viewpoint
> is that throttling is the better approach, but I'm probably biased by my
> background in the CoCo VM world. Petr and others may see the tradeoffs
> differently.

For CoCo VMs, throttling indeed seems to be better. Embedded devices
seem to benefit more from growing the swiotlb on demand.

As usual, YMMV.

Petr T