SError Interrupt on CPU0, code 0xbf000000 makes kernel panic

Joakim Tjernlund Joakim.Tjernlund at infinera.com
Thu Mar 24 07:01:53 PDT 2022


On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote:
> On 2022-03-24 12:10, Joakim Tjernlund wrote:
> > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports:
> > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write
> >   
> > [   37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError
> > [   37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26
> > [   37.573150] Hardware name: infinera,xr (DT)
> > [   37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS)
> > [   37.574705] pc : 000000000098775c
> > [   37.575063] lr : 0000000000986918
> > [   37.575392] sp : 00000000ffd140a8
> > [   37.575725] x12: 0000000000a36c10
> > [   37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020
> > [   37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c
> > [   37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020
> > [   37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000
> > [   37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [   37.582685] Kernel Offset: disabled
> > [   37.582932] CPU features: 0x00001001,20000842
> > [   37.583509] Memory Limit: none
> > [   37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> > 
> > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS.
> > Is this what to expect?
> > I see that kernel looks for the RAS extension but we don't have that.
> > 
> > Can anything be done not to panic the kernel for such accesses?
> 
> No. The error comes back to the CPU in an unattributable manner, so all 
> it knows is that *something*, at some point in the past, went 
> catastrophically wrong. Saying "this is fine..." and carrying on 
> regardless isn't really viable. IIRC the RAS extension places 
> constraints on the delivery of async SError such that it's slightly more 
> possible to do something with, but without that all bets are off.

And this is because we don't have RAS? If we did have RAS would/could kernel 
sort out the error and the app would get an SIGBUS or similar?

> 
> > Can one build a som sort of blacklisted address spaces which the MMU will block?
> 
> Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never 
> access anything invalid.
> I'm not even entirely joking there - even for address ranges that the 
> kernel *does* know about, you can still SError or deadlock by poking at 
> something that's currently clock-gated or powered off, or lose coherency 
> and cause corruption by accessing memory with the wrong attributes; at 
> worst writing the wrong thing to the wrong place may even physically 
> damage the hardware.
> 
I know /dev/mem is bad and it was an example but such SW errors can happen
elsewhere to, we got one from a badly configured UIO device as well.
HW errors we just have to live with but I hoped we could handle some SW errors
better.

 Jocke


More information about the linux-arm-kernel mailing list