[PATCH v2 2/2] treewide: Add the __GFP_PACKED flag to several non-DMA kmalloc() allocations

Wed Nov 2 13:50:23 PDT 2022

On Wed, Nov 02, 2022 at 11:05:54AM +0000, Catalin Marinas wrote:
> On Tue, Nov 01, 2022 at 12:10:51PM -0700, Isaac Manjarres wrote:
> > On Tue, Nov 01, 2022 at 06:39:40PM +0100, Christoph Hellwig wrote:
> > > On Tue, Nov 01, 2022 at 05:32:14PM +0000, Catalin Marinas wrote:
> > > > There's also the case of low-end phones with all RAM below 4GB and arm64
> > > > doesn't allocate the swiotlb. Not sure those vendors would go with a
> > > > recent kernel anyway.
> > > > 
> > > > So the need for swiotlb now changes from 32-bit DMA to any DMA
> > > > (non-coherent but we can't tell upfront when booting, devices may be
> > > > initialised pretty late).
> > 
> > Not only low-end phones, but there are other form-factors that can fall
> > into this category and are also memory constrained (e.g. wearable
> > devices), so the memory headroom impact from enabling SWIOTLB might be
> > non-negligible for all of these devices. I also think it's feasible for
> > those devices to use recent kernels.
> 
> Another option I had in mind is to disable this bouncing if there's no
> swiotlb buffer, so kmalloc() will return ARCH_DMA_MINALIGN (or the
> typically lower cache_line_size()) aligned objects. That's at least
> until we find a lighter way to do bouncing. Those devices would work as
> before.

The SWIOTLB buffer will not be allocated in cases with devices that have
low amounts of RAM sitting entirely below 4 GB. Those devices though
would still benefit greatly from kmalloc() using a smaller size for objects, so
it would be unfortunate to not allow this based on the existence of the
SWIOTLB buffer.
> > > Yes.  The other option would be to use the dma coherent pool for the
> > > bouncing, which must be present on non-coherent systems anyway.  But
> > > it would require us to write a new set of bounce buffering routines.
> > 
> > I think in addition to having to write new bounce buffering routines,
> > this approach still suffers the same problem as SWIOTLB, which is that
> > the memory for SWIOTLB and/or the dma coherent pool is not reclaimable,
> > even when it is not used.
> 
> The dma coherent pool at least it has the advantage that its size can be
> increased at run-time and we can start with a small one. Not decreased
> though, but if really needed I guess it can be added.
> 
> We'd also skip some cache maintenance here since the coherent pool is
> mapped as non-cacheable already. But to Christoph's point, it does
> require some reworking of the current bouncing code.
Right, I do think it's a good thing that dma coherent pool starts small
and can grow. I don't think it would be too difficult to add logic to
free the memory back. Perhaps using a shrinker might be sufficient to
free back memory when the system is experiencing memory pressure,
instead of relying on some threshold?

> 
> I've seen the expression below in a couple of places in the kernel,
> though IIUC in_atomic() doesn't always detect atomic contexts:
> 
> 	gfpflags = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : GFP_KERNEL;
> 
I'm not too sure about this; I was going more off of how the mapping
callbacks in iommu/dma-iommu.c use the atomic variants of iommu_map.
> > But what about having a pool that has a small amount of memory and is
> > composed of several objects that can be used for small DMA transfers?
> > If the amount of memory in the pool starts falling below a certain
> > threshold, there can be a worker thread--so that we don't have to use
> > GFP_ATOMIC--that can add more memory to the pool?
> 
> If the rate of allocation is high, it may end up calling a slab
> allocator directly with GFP_ATOMIC.
> 
> The main downside of any memory pool is identifying the original pool in
> dma_unmap_*(). We have a simple is_swiotlb_buffer() check looking just
> at the bounce buffer boundaries. For the coherent pool we have the more
> complex dma_free_from_pool().
> 
> With a kmem_cache-based allocator (whether it's behind a mempool or
> not), we'd need something like virt_to_cache() and checking whether it
> is from our DMA cache. I'm not a big fan of digging into the slab
> internals for this. An alternative could be some xarray to remember the
> bounced dma_addr.
Right. I had actually thought of using something like what is in
mm/dma-pool.c and the dma coherent pool, where the pool is backed
by the page allocator, and the objects are of a fixed size (for ARM64
for example, it would be align(192, ARCH_DMA_MINALIGN) == 256, though
it would be good to have a more generic way of calculating this). Then
determining whether an object resides in the pool boils down to scanning
the backing pages for the pool, which dma coherent pool does.

> Anyway, I propose that we try the swiotlb first and look at optimising
> it from there, initially using the dma coherent pool.
Except for the freeing logic, which can be added if needed as you
pointed out, and Christoph's point about reworking the bouncing code,
dma coherent pool doesn't sound like a bad idea.

--Isaac