4.4 BCM5301X ARM regression "External imprecise Data abort"

Rafał Miłecki zajec5 at gmail.com
Fri Apr 8 15:11:35 PDT 2016


On 9 April 2016 at 00:08, Ray Jui <ray.jui at broadcom.com> wrote:
> On 4/8/2016 3:05 PM, Rafał Miłecki wrote:
>>
>> On 9 April 2016 at 00:02, Ray Jui <ray.jui at broadcom.com> wrote:
>>>
>>> On 4/8/2016 1:43 AM, Lucas Stach wrote:
>>>>
>>>>
>>>> Am Freitag, den 08.04.2016, 08:45 +0200 schrieb Rafał Miłecki:
>>>>>
>>>>>
>>>>> On 4 April 2016 at 23:23, Hauke Mehrtens <hauke at hauke-m.de> wrote:
>>>>>>
>>>>>>
>>>>>> On 04/04/2016 11:08 PM, Scott Branden wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 16-04-03 11:13 PM, Rafał Miłecki wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I got regression reports from Netgear R8000 (BCM4709A0) users and
>>>>>>>> did
>>>>>>>> some testing & regression tracking with Aditya.
>>>>>>>>
>>>>>>>> It happens that Linux 4.4 doesn't boot due to the following commits:
>>>>>>>> bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
>>>>>>>> startup")
>>>>>>>> 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
>>>>>>>> 937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
>>>>>>>> handler")
>>>>>>>>
>>>>>>>> In kernel 4.3 we got that abort workaround which was resulting in:
>>>>>>>> [    5.007128] Freeing unused kernel memory: 212K (c0435000 -
>>>>>>>> c046a000)
>>>>>>>> [    5.694632] init: Console is alive
>>>>>>>> [    5.698169] init: - watchdog -
>>>>>>>> [    5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>>>> ignored.
>>>>>>>> As you can see, this abort was happening soon after freeing unused
>>>>>>>> memory and ignoring it *once* did the trick. It was never appearing
>>>>>>>> again.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I assume it only can throw one of these and if it is deactivated it
>>>>>> will
>>>>>> ignore the next one or overwrite it. So it could be that more than one
>>>>>> is thrown here.
>>>>>>
>>>>>>>> With 4.4 similar (or the same?) abort happens earlier (during PCI
>>>>>>>> host
>>>>>>>> driver init) and doesn't get ignored:
>>>>>>>> [    2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
>>>>>>>> [    2.483451] pci 0000:00:00.0:   bridge window [mem
>>>>>>>> 0x08000000-0x085fffff]
>>>>>>>> [    2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus
>>>>>>>> 0001:00
>>>>>>>> [    2.605744] pci_bus 0001:00: root bus resource [mem
>>>>>>>> 0x40000000-0x47ffffff]
>>>>>>>> [    2.612657] pcie_iproc_bcma bcma0:8: link: UP
>>>>>>>> [    2.617241] PCI: bus0: Fast back to back transfers disabled
>>>>>>>> [    2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
>>>>>>>> 00-00]), reconfiguring
>>>>>>>> [    2.631297] PCI: bus1: Fast back to back transfers disabled
>>>>>>>> [    2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>>>> 00-00]), reconfiguring
>>>>>>>> [    2.645035] Unhandled fault: imprecise external abort (0x1406) at
>>>>>>>> 0x00000000
>>>>>>>> (see 4.4.txt for the backtrace)
>>>>>>>>
>>>>>>>> At first I was hoping that we simply need to re-add the removed
>>>>>>>> workaround. I tried it but it appeared that one abort is immediately
>>>>>>>> followed by another:
>>>>>>>> [    2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>>>> 00-00]), reconfiguring
>>>>>>>> [    2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>>>> ignored.
>>>>>>>> [    2.951966] Unhandled fault: imprecise external abort (0x1406) at
>>>>>>>> 0x00000000
>>>>>>>>
>>>>>>>> So it seems that commits bbeb920 and 9254970 broke something in PCI
>>>>>>>> host initialization (or maybe just exposed another bug?). Instead of
>>>>>>>> getting an abort once and late we are getting now many of them and a
>>>>>>>> bit earlier.
>>>>>>
>>>>>>
>>>>>>
>>>>>> These commits mad the kernel earlier "listen" to such errors, so that
>>>>>> they will be shown at the time they occur and not sometime later.
>>>>>
>>>>>
>>>>>
>>>>> So AFAIU with kernel 4.3:
>>>>> 1) Aborts were masked (silent) until "Freeing unused kernel memory"
>>>>> 2) There was one (silent) abort caused by a bootloader
>>>>> 3) There were likely multiple aborts (silent) during early PCI init
>>>>> 4) After unmasking we got only a single abort reported and we were
>>>>> ignoring it
>>>>>
>>>>> With kernel 4.4:
>>>>> 1) All aborts are reported immediately
>>>>> 2) Abort caused by a bootloader gets ignored by ARM code:
>>>>> "Hit pending asynchronous external abort (FSR=0x00001c06) during first
>>>>> unmask"
>>>>> thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on
>>>>> unmask")
>>>>> 3) There are still multiple aborts during PCI init (reported
>>>>> immediately
>>>>> now)
>>>>> 4) To work as before (in 4.3) we should ignore all aborts, not only the
>>>>> 1st one
>>>>>
>>>>> Of course proposed solution is an ugly workaround, we should have no
>>>>> aborts reported in the first place.
>>>>>
>>>> A master abort on the PCI bus during probe of the PCI config space
>>>> (device enumeration) is expected. Most host bridges ignore those errors
>>>> and just return 0 for the read transaction.
>>>>
>>>> Some bridges forward the error onto the AXI/AMBA bus and thus cause
>>>> imprecise external aborts on the ARM core.
>>>
>>>
>>>
>>> Yes, I suspect this is the case for these imprecise external abort
>>> triggered
>>> by the iProc PCIe.
>>>
>>>> If your host bridge doesn't
>>>> have a way to disable error forwarding during PCI bus probe you need to
>>>> install an abort handler. Most implementations based on the designware
>>>> PCIe core do this already.
>>>
>>>
>>>
>>> Is this as simple as registering an abort handler to the hook in the
>>> iProc
>>> PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
>>> abort by returning zero from the abort handler?
>>
>>
>> This is what I did in OpenWrt an hour ago and it seems to be working:
>>
>> http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35
>>
>
> It looks good to me except that I think you should register the hook in
> "iproc_pcie_setup" so both the BCMA and platform based iProc PCIe drivers
> can use it.

Should I add some new field to struct iproc_pcie, like "bool
hook_abort_handler"?

-- 
Rafał



More information about the linux-arm-kernel mailing list