ath10k driver crashes whenever firmware crashes on ARM SoC

Tue Mar 11 03:52:54 EDT 2014

... it's not a complete loss!

This to me says "we need a hook from the driver to call the host
"reset the bus" thing".

We also kinda need it for ath9k/ath5k (if it's not there) so AHB
attached things can be reset by actually poking an SoC reset register.

-a

On 11 March 2014 00:40, Avery Pennarun <apenwarr at gmail.com> wrote:
> On Tue, Mar 11, 2014 at 2:33 AM, Kalle Valo <kvalo at qca.qualcomm.com> wrote:
>> Avery Pennarun <apenwarr at gmail.com> writes:
>>> On Wed, Jan 29, 2014 at 9:41 PM, Avery Pennarun <apenwarr at gmail.com> wrote:
>>>> Still chasing around some people to get a PCIe bus analyzer set up.
>>>
>>> Okay, I finally managed to get enough parts put together to look at
>>> the PCIe bus.  To make things a little more clear, I added a macro
>>> that does essentially:
>>>
>>>    pci_write_config_dword(0, 0x80000000 | __LINE__)
>>>    mdelay(1);
>>>    pci_write_config_dword(0, __LINE__)
>>>
>>> ...at various points in the code.  This way I can see precisely what
>>> was the most recent PCIe transaction before the crash.
>>>
>>> I'm not super familiar with PCIe, but what I think I'm seeing is:
>>>
>>> - the firmware does not need to be loaded yet; sometimes I can crash
>>> it just by doing a cold reset right at driver load time.  So the good
>>> news is, the firmware code is not related.
>>>
>>> - the crash is always in ath10k_pci_device_reset
>>
>> [...]
>>
>>> Does this ring a bell for anyone?  I think I can also export the
>>> traces as csv in case someone wants to look at them.
>>
>> I showed your analysis to an HW engineer and the response I got was
>> "don't do that" (= don't use the cold reset). As you know, we now have a
>> workaround using the warm reset:
>>
>> 00f5482bcd94 ath10k: suspend hardware before reset
>> 9042e17df834 ath10k: refactor suspend/resume functions
>> fc36e3ffcdd0 ath10k: fix device initialization routine
>>
>> Have you tested these? Did they help at all?
>
> Yes, I've tested them and they help, mainly by doing the cold reset
> less often.  However, when the firmware hard crashes in certain ways
> (for example, using my original test case), it looks like warm reset
> can't fix that.  The driver then still must fall back to cold reset
> and, some fairly large percentage of the time (1/3rd?), crashes the
> bus.
>
> We do have a separate reset line controlled by a GPIO.  Using that
> crashes the SoC's PCIe host implementation (whee!).  But I got help
> from the SoC manufacturer and was able to get some instructions for
> resetting their PCIe host controller.  When I do all the magic
> incantations in the right order, the system can recover, albeit with a
> fully reset ath10k chip.  This workaround is unfortunately specific to
> the host device platform so it won't do you much good.
>
> Of course, a good way to avoid the problem is "don't crash the
> firmware then," but that's not as robust as I'd like.  This box is
> doing quite a few things, so rebooting to fix a problem on one of the
> wireless cards is pretty expensive.
>
> Nevertheless, the warm reset changes really do reduce the frequency of
> this a lot, to the point where my workaround is almost never needed.
> Thanks for that!
>
> Have fun,
>
> Avery