4.4 BCM5301X ARM regression "External imprecise Data abort"

Rafał Miłecki zajec5 at gmail.com
Fri Apr 8 15:05:22 PDT 2016


On 9 April 2016 at 00:02, Ray Jui <ray.jui at broadcom.com> wrote:
> On 4/8/2016 1:43 AM, Lucas Stach wrote:
>>
>> Am Freitag, den 08.04.2016, 08:45 +0200 schrieb Rafał Miłecki:
>>>
>>> On 4 April 2016 at 23:23, Hauke Mehrtens <hauke at hauke-m.de> wrote:
>>>>
>>>> On 04/04/2016 11:08 PM, Scott Branden wrote:
>>>>>
>>>>> On 16-04-03 11:13 PM, Rafał Miłecki wrote:
>>>>>>
>>>>>> I got regression reports from Netgear R8000 (BCM4709A0) users and did
>>>>>> some testing & regression tracking with Aditya.
>>>>>>
>>>>>> It happens that Linux 4.4 doesn't boot due to the following commits:
>>>>>> bbeb920 ("ARM: 8422/1: enable imprecise aborts during early kernel
>>>>>> startup")
>>>>>> 9254970 ("ARM: 8447/1: catch pending imprecise abort on unmask")
>>>>>> 937b123 ("ARM: BCM5301X: remove workaround imprecise abort fault
>>>>>> handler")
>>>>>>
>>>>>> In kernel 4.3 we got that abort workaround which was resulting in:
>>>>>> [    5.007128] Freeing unused kernel memory: 212K (c0435000 -
>>>>>> c046a000)
>>>>>> [    5.694632] init: Console is alive
>>>>>> [    5.698169] init: - watchdog -
>>>>>> [    5.701470] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>> ignored.
>>>>>> As you can see, this abort was happening soon after freeing unused
>>>>>> memory and ignoring it *once* did the trick. It was never appearing
>>>>>> again.
>>>>
>>>>
>>>> I assume it only can throw one of these and if it is deactivated it will
>>>> ignore the next one or overwrite it. So it could be that more than one
>>>> is thrown here.
>>>>
>>>>>> With 4.4 similar (or the same?) abort happens earlier (during PCI host
>>>>>> driver init) and doesn't get ignored:
>>>>>> [    2.478461] pci 0000:00:00.0: PCI bridge to [bus 01]
>>>>>> [    2.483451] pci 0000:00:00.0:   bridge window [mem
>>>>>> 0x08000000-0x085fffff]
>>>>>> [    2.599449] pcie_iproc_bcma bcma0:8: PCI host bridge to bus 0001:00
>>>>>> [    2.605744] pci_bus 0001:00: root bus resource [mem
>>>>>> 0x40000000-0x47ffffff]
>>>>>> [    2.612657] pcie_iproc_bcma bcma0:8: link: UP
>>>>>> [    2.617241] PCI: bus0: Fast back to back transfers disabled
>>>>>> [    2.622845] pci 0001:00:00.0: bridge configuration invalid ([bus
>>>>>> 00-00]), reconfiguring
>>>>>> [    2.631297] PCI: bus1: Fast back to back transfers disabled
>>>>>> [    2.636887] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>> 00-00]), reconfiguring
>>>>>> [    2.645035] Unhandled fault: imprecise external abort (0x1406) at
>>>>>> 0x00000000
>>>>>> (see 4.4.txt for the backtrace)
>>>>>>
>>>>>> At first I was hoping that we simply need to re-add the removed
>>>>>> workaround. I tried it but it appeared that one abort is immediately
>>>>>> followed by another:
>>>>>> [    2.936895] pci 0001:01:00.0: bridge configuration invalid ([bus
>>>>>> 00-00]), reconfiguring
>>>>>> [    2.945053] External imprecise Data abort at addr=0x0, fsr=0x1406
>>>>>> ignored.
>>>>>> [    2.951966] Unhandled fault: imprecise external abort (0x1406) at
>>>>>> 0x00000000
>>>>>>
>>>>>> So it seems that commits bbeb920 and 9254970 broke something in PCI
>>>>>> host initialization (or maybe just exposed another bug?). Instead of
>>>>>> getting an abort once and late we are getting now many of them and a
>>>>>> bit earlier.
>>>>
>>>>
>>>> These commits mad the kernel earlier "listen" to such errors, so that
>>>> they will be shown at the time they occur and not sometime later.
>>>
>>>
>>> So AFAIU with kernel 4.3:
>>> 1) Aborts were masked (silent) until "Freeing unused kernel memory"
>>> 2) There was one (silent) abort caused by a bootloader
>>> 3) There were likely multiple aborts (silent) during early PCI init
>>> 4) After unmasking we got only a single abort reported and we were
>>> ignoring it
>>>
>>> With kernel 4.4:
>>> 1) All aborts are reported immediately
>>> 2) Abort caused by a bootloader gets ignored by ARM code:
>>> "Hit pending asynchronous external abort (FSR=0x00001c06) during first
>>> unmask"
>>> thanks to 9254970 ("ARM: 8447/1: catch pending imprecise abort on
>>> unmask")
>>> 3) There are still multiple aborts during PCI init (reported immediately
>>> now)
>>> 4) To work as before (in 4.3) we should ignore all aborts, not only the
>>> 1st one
>>>
>>> Of course proposed solution is an ugly workaround, we should have no
>>> aborts reported in the first place.
>>>
>> A master abort on the PCI bus during probe of the PCI config space
>> (device enumeration) is expected. Most host bridges ignore those errors
>> and just return 0 for the read transaction.
>>
>> Some bridges forward the error onto the AXI/AMBA bus and thus cause
>> imprecise external aborts on the ARM core.
>
>
> Yes, I suspect this is the case for these imprecise external abort triggered
> by the iProc PCIe.
>
>> If your host bridge doesn't
>> have a way to disable error forwarding during PCI bus probe you need to
>> install an abort handler. Most implementations based on the designware
>> PCIe core do this already.
>
>
> Is this as simple as registering an abort handler to the hook in the iProc
> PCIe driver, and based on the fsr (0x1406 in our case), simply ignore the
> abort by returning zero from the abort handler?

This is what I did in OpenWrt an hour ago and it seems to be working:
http://git.openwrt.org/?p=openwrt.git;a=commitdiff;h=f823c5da71f0dd859facc5ece575a48c28279d35

-- 
Rafał



More information about the linux-arm-kernel mailing list