[PATCH 1/2] media: rkvdec: reduce excessive stack usage in assemble_hw_pps()

Arnd Bergmann arnd at arndb.de
Mon Feb 2 07:59:05 PST 2026


On Mon, Feb 2, 2026, at 16:12, Nicolas Dufresne wrote:
> Le lundi 02 février 2026 à 15:09 +0100, Arnd Bergmann a écrit :
>> On Mon, Feb 2, 2026, at 14:42, Nicolas Dufresne wrote:

>> Right, this randconfig build likely got closer to the warning
>> limit because of the inherent overhead in KASAN, but the problem
>> with the unaligned bitfields was something that I could later
>> reproduce without KASAN, on ARMv5 and MIPS32r2.
>> 
>> This is something we should fix in clang.
>
> All fair comments. I plan to take this into fixes (no changes needed), hopefully
> for rc-2.
>
> Performance wise, this code is to replace read/mask/write into hardware
> registers which was significantly slower for this amount of registers (~200
> 32bit integers) and this type of IP (its not sram). This is run once per frame.
> In practice, if we hand code the read/mask/write, the performance should
> eventually converge to using bitfield and letting the compiler do this masking,
> I was being optimistic on how the compiler would behave. If performance of that
> is truly a problem, we can always just prepare the ram register ahead of the
> operation queue (instead of doing it in the executor).

I think there are multiple things going on here, some of which are
more relevant than others:

 - The problem I'm addressing with my patch is purely a clang issue
   for CPU architectures with high register pressure when assembling
   the structure in memory. As a first-order approximation, you can
   see the lines in the output being 12.000 with clang, but only
   600 with gcc in the godbolt.org output. The gcc version isn't that
   great either, but it is orders of magnitude fewer instructions.

-  MMIO reads are clearly a performance killer, so assembling the
   structure in memory and using memcpy_toio() to access the
   registers as you appear to  be doing is the right idea.

 - using bitfields for hardware structures is non-portable. In
   particular, the order of the fields within a word depends on
   byteorder (CONFIG_CPU_BIG_ENDIAN), and the alignment depends
   on the architecture, e.g. 'struct { u32 a:16: u32 b: 32; u32 c:16};
   has the second member cross a u32 boundary, which leads to
   padding between a and b, as well as after c on some architectures
   but not others. I would always recommend splitting up bitfields
   on word boundaries and adding explicit padding where necessary.

 - Since most of the fields are exactly 6 bits offset from a word
   boundary, you can try assembling all the  *_field_order_cnt*
   fields in an array first that has all the bits in the correct
   order, but then shift the entire array six bits.

      Arnd



More information about the linux-arm-kernel mailing list