[PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps()

Mon Feb 2 08:31:40 PST 2026

Hi,

Le lundi 02 février 2026 à 16:59 +0100, Arnd Bergmann a écrit :
> On Mon, Feb 2, 2026, at 16:12, Nicolas Dufresne wrote:
> > Le lundi 02 février 2026 à 15:09 +0100, Arnd Bergmann a écrit :
> > > On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote:
> 
> > > Right, this randconfig build likely got closer to the warning
> > > limit because of the inherent overhead in KASAN, but the problem
> > > with the unaligned bitfields was something that I could later
> > > reproduce without KASAN, on ARMv5 and MIPS32r2.
> > > 
> > > This is something we should fix in clang.
> > 
> > All fair comments. I plan to take this into fixes (no changes needed), hopefully
> > for rc-2.
> > 
> > Performance wise, this code is to replace read/mask/write into hardware
> > registers which was significantly slower for this amount of registers (~200
> > 32bit integers) and this type of IP (its not sram). This is run once per frame.
> > In practice, if we hand code the read/mask/write, the performance should
> > eventually converge to using bitfield and letting the compiler do this masking,
> > I was being optimistic on how the compiler would behave. If performance of that
> > is truly a problem, we can always just prepare the ram register ahead of the
> > operation queue (instead of doing it in the executor).
> 
> I think there are multiple things going on here, some of which are
> more relevant than others:
> 
>  - The problem I'm addressing with my patch is purely a clang issue
>    for CPU architectures with high register pressure when assembling
>    the structure in memory. As a first-order approximation, you can
>    see the lines in the output being 12.000 with clang, but only
>    600 with gcc in the godbolt.org output. The gcc version isn't that
>    great either, but it is orders of magnitude fewer instructions.
> 
> -  MMIO reads are clearly a performance killer, so assembling the
>    structure in memory and using memcpy_toio() to access the
>    registers as you appear to  be doing is the right idea.
> 
>  - using bitfields for hardware structures is non-portable. In
>    particular, the order of the fields within a word depends on
>    byteorder (CONFIG_CPU_BIG_ENDIAN), and the alignment depends
>    on the architecture, e.g. 'struct { u32 a:16: u32 b: 32; u32 c:16};
>    has the second member cross a u32 boundary, which leads to
>    padding between a and b, as well as after c on some architectures
>    but not others. I would always recommend splitting up bitfields
>    on word boundaries and adding explicit padding where necessary.

Ok, got it, clearly the registers bitfield (which is a set of 32bit bitfield) is
fine (appart from endian, but this is deliberatly ignored). These are the one I
had mind, and are optimized with copy_toio.

For the SPS/PPS bistream, which is shared memory with the IP, I tend to agree
this might not have been the ideal choice, though the author did verify
everything with pahole for the relevant architectures (in practice only two
ARM64 SoC use this bitstream format). I'm happy to revisit this eventually. And
would not hurt to share a common bitstream writer, that works with both endian
in V4L2 (or use one from the core if that already exist).

Nicolas

> 
>  - Since most of the fields are exactly 6 bits offset from a word
>    boundary, you can try assembling all the  *_field_order_cnt*
>    fields in an array first that has all the bits in the correct
>    order, but then shift the entire array six bits.

> 
>       Arnd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: This is a digitally signed message part
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20260202/6a7550f4/attachment-0001.sig>