Linux freezes after a time while running

Tue Nov 1 20:31:14 PDT 2016

On 31 October 2016 at 17:48, Conrad Kostecki <ck+ath10k at bl4ckb0x.de> wrote:
> Hello Michał,
>
> Am 31.10.2016 11:12:03, "Michal Kazior" <michal.kazior at tieto.com> schrieb:
>
>>  You could try loading ath10k_pci with reset_mode=1 parameter.
>>
>> Cold reset is known to cause some problems after firmware certain
>> crashes and I've personally experienced system freezes on x86 (MIPS
>> tends to spit "data bus error" and doesn't lock up).
>
> thank you very much for your answer. I've now set reset_mode=1,
> which seems to be now active, as I can see in dmesg:
> [    8.471659] ath10k_pci 0000:08:00.0: pci irq msi oper_irq_mode 2 irq_mode
> 0 reset_mode 1
> [    8.587267] ath10k_pci 0000:09:00.0: pci irq msi oper_irq_mode 2 irq_mode
> 0 reset_mode 1
>
> After starting HostAPd, I powered up my Squeezebox Radio to connect via
> WiFi.
> Just after a few minutes, it crashed, as expected, but it did not restart
> the whole server.
> It this due reset_mode=1? I was now able to capture a lot of information
> from dmesg.
> You can clearly see, that the firmware crashed. The HostAPd process is still
> running,
> but the WiFi can be detected anymore.
>
> As it's very much, I've put this on pastebin: http://pastebin.com/83WZktp6
>
> You can see at mark 250, WiFi1 (2.4GHz) comes up and at mark 356 WIFI2
> (5GHz) comes up.
> By mark 691, ath10k_pci crashed and WiFi stopped working. Normally, at this
> point the whole server would reboot.
>
> I've also now tried the newest firmware 10.2.4.70.58 which nuso luck.
>
> Any Ideas?

The driver is unable to retrieve register dump and there's a lot of
failures happening. This doesn't look like a firmware crash per se.
More like device failure caused by host refusing pcie access or
something (very similar to when x86 iommu refuses dma access on
use-after-free) being reported as one (target cpu catches the fault,
runs the handler which is treated as an uncaught assert and is
reported to host same way an assert would, but with a different
register dump).

I suspect the pcie link gets broken one direction because attempting a
cold reset did crash the host even harder.

Looks like the device went into a very confused state due to pcie link
failure starting from:

  [  691.609836] pcieport 0000:00:02.0: AER: Multiple Uncorrected
(Non-Fatal) error received: id=0800

I'm not really familiar with these. Perhaps there's a pcie bridge
problem on your host platform or maybe an electrical issue (e.g.
insufficient power supply to handle short bursts?).

Michal