ath10k driver crashes whenever firmware crashes on ARM SoC

Sun Feb 9 03:00:45 EST 2014

On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr at gmail.com> wrote:
> Still chasing around some people to get a PCIe bus analyzer set up.

Okay, I finally managed to get enough parts put together to look at
the PCIe bus.  To make things a little more clear, I added a macro
that does essentially:

   pci_write_config_dword(0, 0x80000000 | __LINE__)
   mdelay(1);
   pci_write_config_dword(0, __LINE__)

...at various points in the code.  This way I can see precisely what
was the most recent PCIe transaction before the crash.

I'm not super familiar with PCIe, but what I think I'm seeing is:

- the firmware does not need to be loaded yet; sometimes I can crash
it just by doing a cold reset right at driver load time.  So the good
news is, the firmware code is not related.

- the crash is always in ath10k_pci_device_reset

- there are definitely some missing memory barriers in here; in a few
cases you can clearly see a write getting done before the read that
came before it.  Looking at the definitions for iowrite32 and
ioread32, and for rmb() and wmb(), we can see that the use of rmb()
and wmb() do not work properly (at least on ARM) when you care about
the ordering between reads and writes.  However, I don't think this
actually causes the problem.

- the crash happens after writing the 1 to SBC_GLOBAL_RESET_ADDRESS.
The write gets an ACK from the device, so there are no interrupted
PCIe transactions.

- after writing that 1, the PCI bus is fine for ~272 usec.  I can see
the first pci_write_config_dword in my macro above, but it crashes
during the mdelay(1) and the second pci_write_config doesn't go
through.

- ~272 usec after the write, I see TS1 packets getting transmitted at
maximum speed in both directions.  Does this mean the connection is
retraining?

- 50 usec after the first TS1 packet (a surprisingly precise number),
I see an EIOS packet sent in the downstream direction.  After that,
they appear every 25 usec.  However, they *all* show invalid parity
bits according to the PCI analyzer.

Does this ring a bell for anyone?  I think I can also export the
traces as csv in case someone wants to look at them.

Thanks!