SError Interrupt on CPU0, code 0xbf000000 makes kernel panic

Thu Mar 24 07:50:44 PDT 2022

On Thu, 2022-03-24 at 14:17 +0000, Marc Zyngier wrote:
> On Thu, 24 Mar 2022 14:01:53 +0000,
> Joakim Tjernlund <Joakim.Tjernlund at infinera.com> wrote:
> > 
> > On Thu, 2022-03-24 at 13:16 +0000, Robin Murphy wrote:
> > > On 2022-03-24 12:10, Joakim Tjernlund wrote:
> > > > We have a custom SOC, CPU A53, that when an app accesses non existing address space reports:
> > > > # > devmem 0x20000000 w 0x1000 #this will open /dev/mem and write
> > > >   
> > > > [   37.570886] SError Interrupt on CPU0, code 0xbf000000 -- SError
> > > > [   37.571974] CPU: 0 PID: 72 Comm: devmem Not tainted 5.15.26-g18447c6fff6f-dirty #26
> > > > [   37.573150] Hardware name: infinera,xr (DT)
> > > > [   37.573599] pstate: 60000010 (nZCv q A32 LE aif -DIT -SSBS)
> > > > [   37.574705] pc : 000000000098775c
> > > > [   37.575063] lr : 0000000000986918
> > > > [   37.575392] sp : 00000000ffd140a8
> > > > [   37.575725] x12: 0000000000a36c10
> > > > [   37.576443] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000020
> > > > [   37.577872] x8 : 00000000ffd141c0 x7 : 00000000ffd14104 x6 : 0000000000986c9c
> > > > [   37.579278] x5 : 000000000000001f x4 : 0000000000000004 x3 : 0000000000a37020
> > > > [   37.580635] x2 : 0000000000000003 x1 : 0000000000001000 x0 : 0000000000000000
> > > > [   37.582164] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > > [   37.582685] Kernel Offset: disabled
> > > > [   37.582932] CPU features: 0x00001001,20000842
> > > > [   37.583509] Memory Limit: none
> > > > [   37.630058] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> > > > 
> > > > and the kernel panics. This is a surprise as I expected the app to just be killed bus a SIGBUS.
> > > > Is this what to expect?
> > > > I see that kernel looks for the RAS extension but we don't have that.
> > > > 
> > > > Can anything be done not to panic the kernel for such accesses?
> > > 
> > > No. The error comes back to the CPU in an unattributable manner, so all 
> > > it knows is that *something*, at some point in the past, went 
> > > catastrophically wrong. Saying "this is fine..." and carrying on 
> > > regardless isn't really viable. IIRC the RAS extension places 
> > > constraints on the delivery of async SError such that it's slightly more 
> > > possible to do something with, but without that all bets are off.
> > 
> > And this is because we don't have RAS? If we did have RAS
> > would/could kernel  sort out the error and the app would get an
> > SIGBUS or similar?
> 
> With RAS, the error would be containable, and attributed to the
> userspace task by the kernel on the next exception. Without RAS, panic
> is the only option, as we have no idea what the damage is. The machine
> is on fire, for all we know.

Thanks, now I know.

> 
> > 
> > > 
> > > > Can one build a som sort of blacklisted address spaces which the MMU will block?
> > > 
> > > Sure, just configure the kernel with CONFIG_DEVMEM=n and it should never 
> > > access anything invalid.
> > > I'm not even entirely joking there - even for address ranges that the 
> > > kernel *does* know about, you can still SError or deadlock by poking at 
> > > something that's currently clock-gated or powered off, or lose coherency 
> > > and cause corruption by accessing memory with the wrong attributes; at 
> > > worst writing the wrong thing to the wrong place may even physically 
> > > damage the hardware.
> > > 
> > I know /dev/mem is bad and it was an example but such SW errors can
> > happen elsewhere to, we got one from a badly configured UIO device
> > as well.  HW errors we just have to live with but I hoped we could
> > handle some SW errors better.
> 
> I think you have the wrong end of the stick here. This *is* a HW
> error, and the HW tells you so in no uncertain terms that something is
> really bad.

Yes, SW induced HW error is a better description.

> 
> If the device is supposed to be assignable to userspace, it either
> must be designed not to respond with a SError no matter what userspace
> is throwing at it (because let's face it, userspace will eventually do
> something really bad), or the whole system must be designed in a way
> that such error can be contained and attributed to the offending
> party.
> 
> Just giving userspace any odd device and hoping that it will all be
> fine is unfortunately wishful thinking.

Sure, just want to limit the damage where I can. A ptr access to non existing space is not really harmful
and I want the app to take the hit for it. At least then you can log/trouble shoot easier.

 Jocke