X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

Jon Masters jcm at redhat.com
Sat Sep 5 13:13:12 PDT 2015


On 08/11/2015 03:28 PM, Bjorn Helgaas wrote:
> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang at apm.com> wrote:
>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas at google.com> wrote:
>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang at apm.com> wrote:
>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas at google.com> wrote:
>>>>>
>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang at apm.com> wrote:
>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas at google.com>
>>>>>> wrote:
>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>>>>>
>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this
>>>>>>>>> board?
>>>>>>>>
>>>>>>>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>>>>>> tests
>>>>>>>> on a different type of card now.
>>>>>>>
>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>>>>> the
>>>>>>> same test on a machine with a different proprietary card succeeded.
>>>>>>
>>>>>> Thanks, Bjorn.
>>>>>>
>>>>>> I don't have the same Mellanox card as yours, but I will also run
>>>>>> similar reboot test to see if I hit the same issue with my card.
>>>>>
>>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>>> other machines.  Could this be a hardware issue like a signal
>>>>> integrity or margin issue?  I don't know where to go from here because
>>>>> I'm not a hardware person, and I don't know anything to do in
>>>>> software.
>>>>
>>>>
>>>> Hi Bjorn,
>>>>
>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>>> the crash that you encounterred.
>>>>
>>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>>> link issues on my Mellanox cards with its old firmware before.
>>>
>>> Good idea; I'll check that, too.  Also, I just learned that these
>>> cards on installed with an extender card because of some space issues,
>>> so we're going to test again without the extender.
>>
>> Hi Bjorn,
>>
>> Are other cards that passed your test installed directly to the
>> on-board PCIe slot?
>> If yes, then this is a good data point and it will be useful to test
>> the case where
>> your Mellanox cards are directly installed into the on-board PCIe slot.
> 
> The cards that passed the test were installed directly, with  no
> extender.  We removed the extender from one of the machines with the
> Mellanox card and have not seen this issue since then.  I think it's
> very likely that the problem is related to using the extender.

If you're trying to use Mellanox cards in (for example) an APM Mustang
like system with a PCIe extender card (for example a 90 degree angle
adjustment for a low profile server case), you might want to ping me
offline. I have procured a number of these over the past couple of years
for my home lab and have found one that works (almost) reliably on that
particular hardware platform and does 10G in my home lab.

Jon.





More information about the linux-arm-kernel mailing list