4.4 BCM5301X ARM regression "External imprecise Data abort"

Rafał Miłecki zajec5 at gmail.com
Fri Apr 8 15:53:15 PDT 2016


On 9 April 2016 at 00:41, Ray Jui <ray.jui at broadcom.com> wrote:
> On 4/8/2016 3:11 PM, Rafał Miłecki wrote:
>>
>> On 9 April 2016 at 00:08, Ray Jui <ray.jui at broadcom.com> wrote:
>>>
>>> On 4/8/2016 3:05 PM, Rafał Miłecki wrote:
>>>>
>>>>
>>>> On 9 April 2016 at 00:02, Ray Jui <ray.jui at broadcom.com> wrote:
>>>>>
>>>>>
>>>>> On 4/8/2016 1:43 AM, Lucas Stach wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am Freitag, den 08.04.2016, 08:45 +0200 schrieb Rafał Miłecki:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 4 April 2016 at 23:23, Hauke Mehrtens <hauke at hauke-m.de> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/04/2016 11:08 PM, Scott Branden wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16-04-03 11:13 PM, Rafał Miłecki wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I got regression reports from Netgear R8000 (BCM4709A0) users and
>>>>>>>>>> did
>>>>>>>>>> some testing & regression tracking with Aditya.
>>>>>>>>>>
>>>>>>>>>> It happens that Linux 4.4 doesn't boot due to the following
>>>>>>>>>> commits:
>>>>>>>>>> bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
>>>>>>>>>> startup")
>>>>>>>>>> 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
>>>>>>>>>> 937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
>>>>>>>>>> handler")
>>>>>>>>>>
>>>>>>>>>> In kernel 4.3 we got that abort workaround which was resulting in:
>>>>>>>>>> [    5.007128] Freeing unused kernel memory: 212K (c0435000 -
>>>>>>>>>> c046a000)
>>>>>>>>>> [    5.694632] init: Console is alive
>>>>>>>>>> [    5.698169] init: - watchdog -
>>>>>>>>>> [    5.701470] External imprecise Data abort at addr=0x0,
>>>>>>>>>> fsr=0x1406
>>>>>>>>>> ignored.
>>>>>>>>>> As you can see, this abort was happening soon after freeing unused
>>>>>>>>>> memory and ignoring it *once* did the trick. It was never
>>>>>>>>>> appearing
>>>>>>>>>> again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I assume it only can throw one of these and if it is deactivated it
>>>>>>>> will
>>>>>>>> ignore the next one or overwrite it. So it could be that more than
>>>>>>>> one
>>>>>>>> is thrown here.
>>>>>>>>
>>>>>>>>>> With 4.4 similar (or the same?) abort happens earlier (during PCI
>>>>>>>>>> host
>>>>>>>>>> driver init) and doesn't get ignored:
>>>>>>>>>> [    2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
>>>>>>>>>> [    2.483451] pci 0000:00:00.0:   bridge window [mem
>>>>>>>>>> 0x08000000-0x085fffff]
>>>>>>>>>> [    2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus
>>>>>>>>>> 0001:00
>>>>>>>>>> [    2.605744] pci_bus 0001:00: root bus resource [mem
>>>>>>>>>> 0x40000000-0x47ffffff]
>>>>>>>>>> [    2.612657] pcie_iproc_bcma bcma0:8: link: UP
>>>>>>>>>> [    2.617241] PCI: bus0: Fast back to back transfers disabled
>>>>>>>>>> [    2.622845] pci 0001:00:00.0: bridge configuration invalid
>>>>>>>>>> ([bus
>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>> [    2.631297] PCI: bus1: Fast back to back transfers disabled
>>>>>>>>>> [    2.636887] pci 0001:01:00.0: bridge configuration invalid
>>>>>>>>>> ([bus
>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>> [    2.645035] Unhandled fault: imprecise external abort (0x1406)
>>>>>>>>>> at
>>>>>>>>>> 0x00000000
>>>>>>>>>> (see 4.4.txt for the backtrace)
>>>>>>>>>>
>>>>>>>>>> At first I was hoping that we simply need to re-add the removed
>>>>>>>>>> workaround. I tried it but it appeared that one abort is
>>>>>>>>>> immediately
>>>>>>>>>> followed by another:
>>>>>>>>>> [    2.936895] pci 0001:01:00.0: bridge configuration invalid
>>>>>>>>>> ([bus
>>>>>>>>>> 00-00]), reconfiguring
>>>>>>>>>> [    2.945053] External imprecise Data abort at addr=0x0,
>>>>>>>>>> fsr=0x1406
>>>>>>>>>> ignored.
>>>>>>>>>> [    2.951966] Unhandled fault: imprecise external abort (0x1406)
>>>>>>>>>> at
>>>>>>>>>> 0x00000000
>>>>>>>>>>
>>>>>>>>>> So it seems that commits bbeb920 and 9254970 broke something in
>>>>>>>>>> PCI
>>>>>>>>>> host initialization (or maybe just exposed another bug?). Instead
>>>>>>>>>> of
>>>>>>>>>> getting an abort once and late we are getting now many of them and
>>>>>>>>>> a
>>>>>>>>>> bit earlier.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> These commits mad the kernel earlier "listen" to such errors, so
>>>>>>>> that
>>>>>>>> they will be shown at the time they occur and not sometime later.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So AFAIU with kernel 4.3:
>>>>>>> 1) Aborts were masked (silent) until "Freeing unused kernel memory"
>>>>>>> 2) There was one (silent) abort caused by a bootloader
>>>>>>> 3) There were likely multiple aborts (silent) during early PCI init
>>>>>>> 4) After unmasking we got only a single abort reported and we were
>>>>>>> ignoring it
>>>>>>>
>>>>>>> With kernel 4.4:
>>>>>>> 1) All aborts are reported immediately
>>>>>>> 2) Abort caused by a bootloader gets ignored by ARM code:
>>>>>>> "Hit pending asynchronous external abort (FSR=0x00001c06) during
>>>>>>> first
>>>>>>> unmask"
>>>>>>> thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on
>>>>>>> unmask")
>>>>>>> 3) There are still multiple aborts during PCI init (reported
>>>>>>> immediately
>>>>>>> now)
>>>>>>> 4) To work as before (in 4.3) we should ignore all aborts, not only
>>>>>>> the
>>>>>>> 1st one
>>>>>>>
>>>>>>> Of course proposed solution is an ugly workaround, we should have no
>>>>>>> aborts reported in the first place.
>>>>>>>
>>>>>> A master abort on the PCI bus during probe of the PCI config space
>>>>>> (device enumeration) is expected. Most host bridges ignore those
>>>>>> errors
>>>>>> and just return 0 for the read transaction.
>>>>>>
>>>>>> Some bridges forward the error onto the AXI/AMBA bus and thus cause
>>>>>> imprecise external aborts on the ARM core.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Yes, I suspect this is the case for these imprecise external abort
>>>>> triggered
>>>>> by the iProc PCIe.
>>>>>
>>>>>> If your host bridge doesn't
>>>>>> have a way to disable error forwarding during PCI bus probe you need
>>>>>> to
>>>>>> install an abort handler. Most implementations based on the designware
>>>>>> PCIe core do this already.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is this as simple as registering an abort handler to the hook in the
>>>>> iProc
>>>>> PCIe driver, and based on the fsr (0x1406 in our case), simply ignore
>>>>> the
>>>>> abort by returning zero from the abort handler?
>>>>
>>>>
>>>>
>>>> This is what I did in OpenWrt an hour ago and it seems to be working:
>>>>
>>>>
>>>> http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
>>>>
>>>
>>> It looks good to me except that I think you should register the hook in
>>> "iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
>>> can use it.
>>
>>
>> Should I add some new field to struct iproc_pcie, like "bool
>> hook_abort_handler"?
>>
>
> You want to enable/disable them based on platforms? I don't see a need at
> this point...

I was assuming we don't want this handler hooked on Northstart+, where
the issue doesn't occur. If you think it's not worth it, we can hook
it on all platforms.

-- 
Rafał



More information about the linux-arm-kernel mailing list