4.4 BCM5301X ARM regression "External imprecise Data abort"

Ray Jui ray.jui at broadcom.com
Fri Apr 8 15:41:11 PDT 2016



On 4/8/2016 3:11 PM, Rafał Miłecki wrote:
> On 9 April 2016 at 00:08, Ray Jui <ray.jui at broadcom.com> wrote:
>> On 4/8/2016 3:05 PM, Rafał Miłecki wrote:
>>>
>>> On 9 April 2016 at 00:02, Ray Jui <ray.jui at broadcom.com> wrote:
>>>>
>>>> On 4/8/2016 1:43 AM, Lucas Stach wrote:
>>>>>
>>>>>
>>>>> Am Freitag, den 08.04.2016, 08:45 +0200 schrieb Rafał Miłecki:
>>>>>>
>>>>>>
>>>>>> On 4 April 2016 at 23:23, Hauke Mehrtens <hauke at hauke-m.de> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 04/04/2016 11:08 PM, Scott Branden wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 16-04-03 11:13 PM, Rafał Miłecki wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I got regression reports from Netgear R8000 (BCM4709A0) users and
>>>>>>>>> did
>>>>>>>>> some testing & regression tracking with Aditya.
>>>>>>>>>
>>>>>>>>> It happens that Linux 4.4 doesn't boot due to the following commits:
>>>>>>>>> bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
>>>>>>>>> startup")
>>>>>>>>> 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
>>>>>>>>> 937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
>>>>>>>>> handler")
>>>>>>>>>
>>>>>>>>> In kernel 4.3 we got that abort workaround which was resulting in:
>>>>>>>>> [    5.007128] Freeing unused kernel memory: 212K (c0435000 -
>>>>>>>>> c046a000)
>>>>>>>>> [    5.694632] init: Console is alive
>>>>>>>>> [    5.698169] init: - watchdog -
>>>>>>>>> [    5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>>>>> ignored.
>>>>>>>>> As you can see, this abort was happening soon after freeing unused
>>>>>>>>> memory and ignoring it *once* did the trick. It was never appearing
>>>>>>>>> again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I assume it only can throw one of these and if it is deactivated it
>>>>>>> will
>>>>>>> ignore the next one or overwrite it. So it could be that more than one
>>>>>>> is thrown here.
>>>>>>>
>>>>>>>>> With 4.4 similar (or the same?) abort happens earlier (during PCI
>>>>>>>>> host
>>>>>>>>> driver init) and doesn't get ignored:
>>>>>>>>> [    2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
>>>>>>>>> [    2.483451] pci 0000:00:00.0:   bridge window [mem
>>>>>>>>> 0x08000000-0x085fffff]
>>>>>>>>> [    2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus
>>>>>>>>> 0001:00
>>>>>>>>> [    2.605744] pci_bus 0001:00: root bus resource [mem
>>>>>>>>> 0x40000000-0x47ffffff]
>>>>>>>>> [    2.612657] pcie_iproc_bcma bcma0:8: link: UP
>>>>>>>>> [    2.617241] PCI: bus0: Fast back to back transfers disabled
>>>>>>>>> [    2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>> [    2.631297] PCI: bus1: Fast back to back transfers disabled
>>>>>>>>> [    2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>> [    2.645035] Unhandled fault: imprecise external abort (0x1406) at
>>>>>>>>> 0x00000000
>>>>>>>>> (see 4.4.txt for the backtrace)
>>>>>>>>>
>>>>>>>>> At first I was hoping that we simply need to re-add the removed
>>>>>>>>> workaround. I tried it but it appeared that one abort is immediately
>>>>>>>>> followed by another:
>>>>>>>>> [    2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>> [    2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>>>>> ignored.
>>>>>>>>> [    2.951966] Unhandled fault: imprecise external abort (0x1406) at
>>>>>>>>> 0x00000000
>>>>>>>>>
>>>>>>>>> So it seems that commits bbeb920 and 9254970 broke something in PCI
>>>>>>>>> host initialization (or maybe just exposed another bug?). Instead of
>>>>>>>>> getting an abort once and late we are getting now many of them and a
>>>>>>>>> bit earlier.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> These commits mad the kernel earlier "listen" to such errors, so that
>>>>>>> they will be shown at the time they occur and not sometime later.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So AFAIU with kernel 4.3:
>>>>>> 1) Aborts were masked (silent) until "Freeing unused kernel memory"
>>>>>> 2) There was one (silent) abort caused by a bootloader
>>>>>> 3) There were likely multiple aborts (silent) during early PCI init
>>>>>> 4) After unmasking we got only a single abort reported and we were
>>>>>> ignoring it
>>>>>>
>>>>>> With kernel 4.4:
>>>>>> 1) All aborts are reported immediately
>>>>>> 2) Abort caused by a bootloader gets ignored by ARM code:
>>>>>> "Hit pending asynchronous external abort (FSR=0x00001c06) during first
>>>>>> unmask"
>>>>>> thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on
>>>>>> unmask")
>>>>>> 3) There are still multiple aborts during PCI init (reported
>>>>>> immediately
>>>>>> now)
>>>>>> 4) To work as before (in 4.3) we should ignore all aborts, not only the
>>>>>> 1st one
>>>>>>
>>>>>> Of course proposed solution is an ugly workaround, we should have no
>>>>>> aborts reported in the first place.
>>>>>>
>>>>> A master abort on the PCI bus during probe of the PCI config space
>>>>> (device enumeration) is expected. Most host bridges ignore those errors
>>>>> and just return 0 for the read transaction.
>>>>>
>>>>> Some bridges forward the error onto the AXI/AMBA bus and thus cause
>>>>> imprecise external aborts on the ARM core.
>>>>
>>>>
>>>>
>>>> Yes, I suspect this is the case for these imprecise external abort
>>>> triggered
>>>> by the iProc PCIe.
>>>>
>>>>> If your host bridge doesn't
>>>>> have a way to disable error forwarding during PCI bus probe you need to
>>>>> install an abort handler. Most implementations based on the designware
>>>>> PCIe core do this already.
>>>>
>>>>
>>>>
>>>> Is this as simple as registering an abort handler to the hook in the
>>>> iProc
>>>> PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
>>>> abort by returning zero from the abort handler?
>>>
>>>
>>> This is what I did in OpenWrt an hour ago and it seems to be working:
>>>
>>> http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
>>>
>>
>> It looks good to me except that I think you should register the hook in
>> "iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
>> can use it.
>
> Should I add some new field to struct iproc_pcie, like "bool
> hook_abort_handler"?
>

You want to enable/disable them based on platforms? I don't see a need 
at this point...

Thanks,

Ray



More information about the linux-arm-kernel mailing list