Anyone seeing tx-credits 'hang'?

Tue Jan 13 11:07:44 PST 2015

On 01/12/2015 12:06 AM, Michal Kazior wrote:
> On 9 January 2015 at 17:55, Ben Greear <greearb at candelatech.com> wrote:
> [...]
>> One thing I noticed yesterday is that when the driver tries to put a
>> vdev down, the firmware will try to flush, and will delay vdev-down
>> event until fw is flushed.  I changed CT firmware to automatically
>> flush in this case, but perhaps the driver should explicitly ask
>> firmware to flush the vdev before putting it down?
> 
> I recall the discussion we once had. I do plan on doing a patch for
> that, eventually.
> 
> 
>> Once the driver gets out of sync due to timeouts, the firmware
>> is likely to assert soon after if wmi hang doesn't happen because
>> firmware will think vdev is up when it is not, or vice versa.
>>
>> Also, I notice a pattern in the failure case.
>>
>> The sequence is almost always something like this:
>>
>> [lots of vdev up/down, re-associate, etc]
>>
>> vdev down (this would have timed out if I didn't put in the flush)
>>   * vdev down is usually last wmi cmd firmware receives.
>> driver tries to delete peer, that times out (firmware wmi layer never
>>   saw the command)
> 
> So there's a chance htc layer actually did get the buffer but for some
> reason it decided it isn't a wmi buffer. One reason could be the
> buffer contained garbage (e.g. due to missing barrier on host so
> firmware could read some data from an old physical address that was
> stored in ce descriptor item).

I managed to get some better debug out of the firmware.

I am having a hell of a time figuring out how the code flows through all
of the callbacks (in both firmware and driver), but it appears this is what happened:

(I have instrumented transfer-id in both firmware and driver)

firmware sent wmi message with transfer-id of 72.
kernel received this transfer-id
firmware's last send-callback transfer ID is 71.

So, it seems that either ath10k did not do the transfer-complete logic,
did it incorrectly, or the firmware did not notice it was done.

I cannot find where the transfer complete code that should be updating
firmware is at.  If you know, can you point me to it?

Thanks,
Ben

> 
> 
>> firmware reports one or two more messages to driver, and if it manages to report
>> a dbglog, that shows a tx-timeout message usually within a second of
>> the vdev down.  This happens whether or not I flush the vdev bringing it
>> down.
>>
>> At this point, one more request from driver may be sent, after that,
>> it is credit starvation.  Firmware continues to run (timers fire, etc).
>>
>> I think that firmware is also waiting on a completion event from the
>> CE layer...I plan to dig into that more today.
> 
> Hm.. This reminds me of issues hw1.0 had. I'd check if one of the
> workarounds ath10k had changes anything (see
> ath10k_ce_src_ring_write_index_set in ce.c in 5e3dd157ce).
> 
> 
> Michał
> 

-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com