bcm2711_thermal: Kernel panic - not syncing: Asynchronous SError Interrupt

Juerg Haefliger juerg.haefliger at canonical.com
Thu Jul 28 02:06:53 PDT 2022


On Wed, 27 Jul 2022 14:51:24 -0700
Florian Fainelli <f.fainelli at gmail.com> wrote:

> On 7/27/22 01:05, Juerg Haefliger wrote:
> > On Wed, 10 Feb 2021 14:59:45 -0800
> > Florian Fainelli <f.fainelli at gmail.com> wrote:
> >   
> >> On 2/10/2021 8:55 AM, Nicolas Saenz Julienne wrote:  
> >>> Hi Robin,
> >>>
> >>> On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:    
> >>>> On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:    
> >>>>> [ Add Robin, Catalin and Florian in case they want to chime in ]
> >>>>>
> >>>>> Hi Juerg, thanks for the report!
> >>>>>
> >>>>> On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:    
> >>>>>> Trying to dump the BCM2711 registers kills the kernel:
> >>>>>>
> >>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
> >>>>>> 0-efc
> >>>>>> # cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers
> >>>>>>
> >>>>>> [   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError    
> >>>>>
> >>>>> So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
> >>>>> SError,' hence IIUC the rest of the error code is meaningless to anyone outside
> >>>>> of Broadcom/RPi.    
> >>>>
> >>>> It's imp-def from the architecture's PoV, but the implementation in this 
> >>>> case is Cortex-A72, where 0x000002 means an attributable, containable 
> >>>> Slave Error:
> >>>>
> >>>> https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en
> >>>>
> >>>> In other words, the thing at the other end of an interconnect 
> >>>> transaction said "no" :)
> >>>>
> >>>> (The fact that Cortex-A72 gets too far ahead of itself to take it as a 
> >>>> synchronous external abort is a mild annoyance, but hey...)    
> >>>
> >>> Thanks for both your clarifications! Reading arm documentation is a skill on
> >>> its own.    
> >>
> >> Yes it is.
> >>  
> >>>     
> >>>>> The regmap is created through the following syscon device:
> >>>>>
> >>>>> 	avs_monitor: avs-monitor at 7d5d2000 {
> >>>>> 		compatible = "brcm,bcm2711-avs-monitor",
> >>>>> 			     "syscon", "simple-mfd";
> >>>>> 		reg = <0x7d5d2000 0xf00>;
> >>>>>
> >>>>> 		thermal: thermal {
> >>>>> 			compatible = "brcm,bcm2711-thermal";
> >>>>> 			#thermal-sensor-cells = <0>;
> >>>>> 		};
> >>>>> 	};
> >>>>>
> >>>>> I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
> >>>>> full of addresses that trigger this same error. Also note that as per Florian's
> >>>>> comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
> >>>>> tell, at least 0x7d5d22b0 seems to be faulty too.
> >>>>>
> >>>>> Any ideas/comments? My guess is that those addresses are marked somehow as
> >>>>> secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
> >>>>> the solution is to narrow the register range exposed by avs-monitor to whatever
> >>>>> bcm2711-thermal needs (which is ATM a single 32bit register).    
> >>>>
> >>>> When a peripheral decodes a region of address space, nobody says it has 
> >>>> to accept accesses to *every* address in that space; registers may be 
> >>>> sparsely populated, and although some devices might be "nice" and make 
> >>>> unused areas behave as RAZ/WI, others may throw slave errors if you poke 
> >>>> at the wrong places. As you note, in a TrustZone-aware device some 
> >>>> registers may only exist in one or other of the Secure/Non-Secure 
> >>>> address spaces.
> >>>>
> >>>> Even when there is a defined register at a given address, it still 
> >>>> doesn't necessarily accept all possible types of access; it wouldn't be 
> >>>> particularly friendly, but a device *could* have, say, some registers 
> >>>> that support 32-bit accesses and others that only support 16-bit 
> >>>> accesses, and thus throw slave errors if you do the wrong thing in the 
> >>>> wrong place.
> >>>>
> >>>> It really all depends on the device itself.    
> >>>
> >>> All in all, assuming there is no special device quirk to apply, the feeling I'm
> >>> getting is to just let the error be. As you hint, firmware has no blame here,
> >>> and debugfs is a 'best effort, zero guarantees' interface after all.    
> >>
> >> We should probably fill a regmap_access_table to deny reading registers
> >> for which there is no address decoding and possibly another one to deny
> >> writing to the read-only registers.  
> > 
> > 
> > Below is a patch that adds a read access table but it seems wrong to include
> > 'internal.h' and add the table in the thermal driver. Shouldn't this happen
> > in a higher layer, somehow between syscon and the thermal node?  
> 
> What is the purpose of doing doing this though that cannot already be done using devmem/devmem2 if the point is explore the address space?

The goal is to prevent a kernel crash when doing
$ cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers

...Juerg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20220728/07f36d81/attachment.sig>


More information about the linux-arm-kernel mailing list