Unable to read firmware registers on crash?

Mon Feb 2 22:31:59 PST 2015

On 2 February 2015 at 18:03, Ben Greear <greearb at candelatech.com> wrote:
> On 02/02/2015 04:11 AM, Michal Kazior wrote:
>> On 1 February 2015 at 18:46, Ben Greear <greearb at candelatech.com> wrote:
>>> I am trying to debug a case where firmware occasionally crashes and
>>> the driver cannot read any crash dump to debug problem further.
>>>
>>> Any idea what might be the problem and how I might could read info
>>> from the firmware (or hack firmware to deliver the crash info in some
>>> other means...maybe though a previously reserved piece of memory on
>>> the host??)
>>>
>>> [147334.397148] ath10k: firmware crashed! (uuid
>>> ff405224-b2a4-493d-b619-19ad8152d190)
>>> [147334.404808] ath10k: hardware name qca988x hw2.0 version 0x4100016c
>>> [147334.411111] ath10k: firmware version: 10.1.467-ct-community-full-013
>>> [147334.429603] ath10k: failed to read diag value at 0x1300804: -16
>>> [147334.435647] ath10k: failed to read FW dump area address: -16
>>
>> I see this rarely. Mostly when the device goes bonkers at which point
>> warm reset doesn't work anymore and I'm forced to either risk cold
>> reset causing platform lock up or re-inject the card in the express
>> card slot.
>>
>> I haven't played with this so I don't know if DMA is still possible
>> (it might not). MMIO should be operational and there should be some
>> scratch registers chilling around... ;-)
>
> To normally read the crash dump, we do this over a CE pipe?

Correct. The CE7, so called diagnostic window, is used.

> So, if IRQ handlers are thoroughly busted on the target, then
> that could be reason why we cannot read the crash dump?

Incorrect. From what I understand the diagnostic window CE has
built-in logic for fetching target RAM memory chunks as per each
request. This means even if target program has masked IRQs and is
running in a while (1) {} host is still able to access its memory.

If you're unable to read the dump it means that CE has crashed
(whatever that _actually_ means) and thus the built-in logic for
diagnostic windows goes down as well. I never got around to recover CE
crash alone without resorting to the risky target cold reset.

> If I were to use MMIO, what sort of things could I get
> access to?  Just registers you think?  I might could tweak
> the firmware assert routine to scribble the crash register
> contents into a specific place, perhaps a bit at a time
> so that the host could read it if normal crash dump read
> fails?

I was thinking that instead of doing a while (1) {} loop towards the
end of the assert routine you could instead put a busy loop and
introduce a simple interaction within it via a bunch of MMIO registers
that are software writable (e.g. scratch registers) and aren't used by
the MAC related hw. With that you could have a fallback way to
"stream" the crash dump data by putting each data word in a register
at a time and host ACKing each one.

Michał