SError Interrupt on CPU0, code 0xbf000000 makes kernel panic
Joakim Tjernlund
Joakim.Tjernlund at infinera.com
Thu Mar 24 07:01:53 PDT 2022
On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote:
> On 2022-03-24 12:10, Joakim Tjernlund wrote:
> > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports:
> > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write
> >
> > [ 37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError
> > [ 37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26
> > [ 37.573150] Hardware name: infinera,xr (DT)
> > [ 37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS)
> > [ 37.574705] pc : 000000000098775c
> > [ 37.575063] lr : 0000000000986918
> > [ 37.575392] sp : 00000000ffd140a8
> > [ 37.575725] x12: 0000000000a36c10
> > [ 37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020
> > [ 37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c
> > [ 37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020
> > [ 37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000
> > [ 37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [ 37.582685] Kernel Offset: disabled
> > [ 37.582932] CPU features: 0x00001001,20000842
> > [ 37.583509] Memory Limit: none
> > [ 37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >
> > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS.
> > Is this what to expect?
> > I see that kernel looks for the RAS extension but we don't have that.
> >
> > Can anything be done not to panic the kernel for such accesses?
>
> No. The error comes back to the CPU in an unattributable manner, so all
> it knows is that *something*, at some point in the past, went
> catastrophically wrong. Saying "this is fine..." and carrying on
> regardless isn't really viable. IIRC the RAS extension places
> constraints on the delivery of async SError such that it's slightly more
> possible to do something with, but without that all bets are off.
And this is because we don't have RAS? If we did have RAS would/could kernel
sort out the error and the app would get an SIGBUS or similar?
>
> > Can one build a som sort of blacklisted address spaces which the MMU will block?
>
> Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never
> access anything invalid.
> I'm not even entirely joking there - even for address ranges that the
> kernel *does* know about, you can still SError or deadlock by poking at
> something that's currently clock-gated or powered off, or lose coherency
> and cause corruption by accessing memory with the wrong attributes; at
> worst writing the wrong thing to the wrong place may even physically
> damage the hardware.
>
I know /dev/mem is bad and it was an example but such SW errors can happen
elsewhere to, we got one from a badly configured UIO device as well.
HW errors we just have to live with but I hoped we could handle some SW errors
better.
Jocke
More information about the linux-arm-kernel
mailing list