[PATCH 07/10] crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN

Thu Apr 21 04:06:58 PDT 2022

On Thu, Apr 21, 2022 at 12:20:22AM -0700, Christoph Hellwig wrote:
> Btw, there is another option:  Most real systems already require having
> swiotlb to bounce buffer in some cases.  We could simply force bounce
> buffering in the dma mapping code for too small or not properly aligned
> transfers and just decrease the dma alignment.

We can force bounce if size is small but checking the alignment is
trickier. Normally the beginning of the buffer is aligned but the end is
at some sizeof() distance. We need to know whether the end is in a
kmalloc-128 cache and that requires reaching out to the slab internals.
That's doable and not expensive but it needs to be done for every small
size getting to the DMA API, something like (for mm/slub.c):

	folio = virt_to_folio(x);
	slab = folio_slab(folio);
	if (slab->slab_cache->align < ARCH_DMA_MINALIGN)
		... bounce ...

(and a bit different for mm/slab.c)

If we scrap ARCH_DMA_MINALIGN altogether from arm64, we can check the
alignment against cache_line_size(), though I'd rather keep it for code
that wants to avoid bouncing and goes for this compile-time alignment.

I think we are down to four options (1 and 2 can be combined):

1. ARCH_DMA_MINALIGN == 128, dynamic arch_kmalloc_minalign() to reduce
   kmalloc() alignment to 64 on most arm64 SoC - this series.

2. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 128, add explicit
   __GFP_PACKED for small allocations. It can be combined with (1) so
   that allocations without __GFP_PACKED can still get 64-byte
   alignment.

3. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce.

4. undef ARCH_DMA_MINALIGN, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce.

(3) and (4) don't require histogram analysis. Between them, I have a
preference for (3) as it gives drivers a chance to avoid the bounce.

If (2) is feasible, we don't need to bother with any bouncing or
structure alignments, it's an opt-in by the driver/subsystem. However,
it may be tedious to analyse the hot spots. While there are a few
obvious places (kstrdup), I don't have access to a multitude of devices
that may exercise the drivers and subsystems.

With (3) the risk is someone complaining about performance or even
running out of swiotlb space on some SoCs (I guess the fall-back can be
another kmalloc() with an appropriate size).

I guess we can limit the choice to either (2) or (3). I have (2) already
(needs some more testing). I can attempt (3) and try to run it on some
real hardware to see the perf impact.

-- 
Catalin