4.4 BCM5301X ARM regression "External imprecise Data abort"

Ray Jui ray.jui at broadcom.com
Fri Apr 8 17:00:40 PDT 2016



On 4/8/2016 3:53 PM, Rafał Miłecki wrote:
> On 9 April 2016 at 00:41, Ray Jui <ray.jui at broadcom.com> wrote:
>> On 4/8/2016 3:11 PM, Rafał Miłecki wrote:
>>>
>>> On 9 April 2016 at 00:08, Ray Jui <ray.jui at broadcom.com> wrote:
>>>>
>>>> On 4/8/2016 3:05 PM, Rafał Miłecki wrote:
>>>>>
>>>>>
>>>>> On 9 April 2016 at 00:02, Ray Jui <ray.jui at broadcom.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 4/8/2016 1:43 AM, Lucas Stach wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am Freitag, den 08.04.2016, 08:45 +0200 schrieb Rafał Miłecki:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4 April 2016 at 23:23, Hauke Mehrtens <hauke at hauke-m.de> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 04/04/2016 11:08 PM, Scott Branden wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 16-04-03 11:13 PM, Rafał Miłecki wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I got regression reports from Netgear R8000 (BCM4709A0) users and
>>>>>>>>>>> did
>>>>>>>>>>> some testing & regression tracking with Aditya.
>>>>>>>>>>>
>>>>>>>>>>> It happens that Linux 4.4 doesn't boot due to the following
>>>>>>>>>>> commits:
>>>>>>>>>>> bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
>>>>>>>>>>> startup")
>>>>>>>>>>> 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
>>>>>>>>>>> 937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
>>>>>>>>>>> handler")
>>>>>>>>>>>
>>>>>>>>>>> In kernel 4.3 we got that abort workaround which was resulting in:
>>>>>>>>>>> [    5.007128] Freeing unused kernel memory: 212K (c0435000 -
>>>>>>>>>>> c046a000)
>>>>>>>>>>> [    5.694632] init: Console is alive
>>>>>>>>>>> [    5.698169] init: - watchdog -
>>>>>>>>>>> [    5.701470] External imprecise Data abort at addr=0x0,
>>>>>>>>>>> fsr=0x1406
>>>>>>>>>>> ignored.
>>>>>>>>>>> As you can see, this abort was happening soon after freeing unused
>>>>>>>>>>> memory and ignoring it *once* did the trick. It was never
>>>>>>>>>>> appearing
>>>>>>>>>>> again.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I assume it only can throw one of these and if it is deactivated it
>>>>>>>>> will
>>>>>>>>> ignore the next one or overwrite it. So it could be that more than
>>>>>>>>> one
>>>>>>>>> is thrown here.
>>>>>>>>>
>>>>>>>>>>> With 4.4 similar (or the same?) abort happens earlier (during PCI
>>>>>>>>>>> host
>>>>>>>>>>> driver init) and doesn't get ignored:
>>>>>>>>>>> [    2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
>>>>>>>>>>> [    2.483451] pci 0000:00:00.0:   bridge window [mem
>>>>>>>>>>> 0x08000000-0x085fffff]
>>>>>>>>>>> [    2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus
>>>>>>>>>>> 0001:00
>>>>>>>>>>> [    2.605744] pci_bus 0001:00: root bus resource [mem
>>>>>>>>>>> 0x40000000-0x47ffffff]
>>>>>>>>>>> [    2.612657] pcie_iproc_bcma bcma0:8: link: UP
>>>>>>>>>>> [    2.617241] PCI: bus0: Fast back to back transfers disabled
>>>>>>>>>>> [    2.622845] pci 0001:00:00.0: bridge configuration invalid
>>>>>>>>>>> ([bus
>>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>>> [    2.631297] PCI: bus1: Fast back to back transfers disabled
>>>>>>>>>>> [    2.636887] pci 0001:01:00.0: bridge configuration invalid
>>>>>>>>>>> ([bus
>>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>>> [    2.645035] Unhandled fault: imprecise external abort (0x1406)
>>>>>>>>>>> at
>>>>>>>>>>> 0x00000000
>>>>>>>>>>> (see 4.4.txt for the backtrace)
>>>>>>>>>>>
>>>>>>>>>>> At first I was hoping that we simply need to re-add the removed
>>>>>>>>>>> workaround. I tried it but it appeared that one abort is
>>>>>>>>>>> immediately
>>>>>>>>>>> followed by another:
>>>>>>>>>>> [    2.936895] pci 0001:01:00.0: bridge configuration invalid
>>>>>>>>>>> ([bus
>>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>>> [    2.945053] External imprecise Data abort at addr=0x0,
>>>>>>>>>>> fsr=0x1406
>>>>>>>>>>> ignored.
>>>>>>>>>>> [    2.951966] Unhandled fault: imprecise external abort (0x1406)
>>>>>>>>>>> at
>>>>>>>>>>> 0x00000000
>>>>>>>>>>>
>>>>>>>>>>> So it seems that commits bbeb920 and 9254970 broke something in
>>>>>>>>>>> PCI
>>>>>>>>>>> host initialization (or maybe just exposed another bug?). Instead
>>>>>>>>>>> of
>>>>>>>>>>> getting an abort once and late we are getting now many of them and
>>>>>>>>>>> a
>>>>>>>>>>> bit earlier.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> These commits mad the kernel earlier "listen" to such errors, so
>>>>>>>>> that
>>>>>>>>> they will be shown at the time they occur and not sometime later.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So AFAIU with kernel 4.3:
>>>>>>>> 1) Aborts were masked (silent) until "Freeing unused kernel memory"
>>>>>>>> 2) There was one (silent) abort caused by a bootloader
>>>>>>>> 3) There were likely multiple aborts (silent) during early PCI init
>>>>>>>> 4) After unmasking we got only a single abort reported and we were
>>>>>>>> ignoring it
>>>>>>>>
>>>>>>>> With kernel 4.4:
>>>>>>>> 1) All aborts are reported immediately
>>>>>>>> 2) Abort caused by a bootloader gets ignored by ARM code:
>>>>>>>> "Hit pending asynchronous external abort (FSR=0x00001c06) during
>>>>>>>> first
>>>>>>>> unmask"
>>>>>>>> thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on
>>>>>>>> unmask")
>>>>>>>> 3) There are still multiple aborts during PCI init (reported
>>>>>>>> immediately
>>>>>>>> now)
>>>>>>>> 4) To work as before (in 4.3) we should ignore all aborts, not only
>>>>>>>> the
>>>>>>>> 1st one
>>>>>>>>
>>>>>>>> Of course proposed solution is an ugly workaround, we should have no
>>>>>>>> aborts reported in the first place.
>>>>>>>>
>>>>>>> A master abort on the PCI bus during probe of the PCI config space
>>>>>>> (device enumeration) is expected. Most host bridges ignore those
>>>>>>> errors
>>>>>>> and just return 0 for the read transaction.
>>>>>>>
>>>>>>> Some bridges forward the error onto the AXI/AMBA bus and thus cause
>>>>>>> imprecise external aborts on the ARM core.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, I suspect this is the case for these imprecise external abort
>>>>>> triggered
>>>>>> by the iProc PCIe.
>>>>>>
>>>>>>> If your host bridge doesn't
>>>>>>> have a way to disable error forwarding during PCI bus probe you need
>>>>>>> to
>>>>>>> install an abort handler. Most implementations based on the designware
>>>>>>> PCIe core do this already.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is this as simple as registering an abort handler to the hook in the
>>>>>> iProc
>>>>>> PCIe driver, and based on the fsr (0x1406 in our case), simply ignore
>>>>>> the
>>>>>> abort by returning zero from the abort handler?
>>>>>
>>>>>
>>>>>
>>>>> This is what I did in OpenWrt an hour ago and it seems to be working:
>>>>>
>>>>>
>>>>> http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
>>>>>
>>>>
>>>> It looks good to me except that I think you should register the hook in
>>>> "iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
>>>> can use it.
>>>
>>>
>>> Should I add some new field to struct iproc_pcie, like "bool
>>> hook_abort_handler"?
>>>
>>
>> You want to enable/disable them based on platforms? I don't see a need at
>> this point...
>
> I was assuming we don't want this handler hooked on Northstart+, where
> the issue doesn't occur. If you think it's not worth it, we can hook
> it on all platforms.
>

I see. In this case, we might need a device tree based configuration 
that allows us to enable/disable the abort handler for different 
platforms (for all iProc platform bus based clients, including Cygnus, 
NSP, NS2, and etc.). It sounds like even for all these iProc SoCs that 
do not use BCMA, some need this abort hook and some don't.

Do you have any bandwidth to work on that? If not, you can leave the 
hook always installed in the "iproc_pcie_setup" routine for now, and 
later on when I have time I'll work out something for it.

Thanks,

Ray



More information about the linux-arm-kernel mailing list