[RFT] ath10k: restart fw on tx-credit timeout

Thu Feb 12 05:21:22 PST 2015

On 02/11/2015 10:55 PM, Michal Kazior wrote:
> On 11 February 2015 at 23:25, Ben Greear <greearb at candelatech.com> wrote:
>> On 02/10/2015 09:01 AM, Ben Greear wrote:
>>
>>> I've hacked CT firmware to do a flush of all vdevs itself when it detects WMI hang.
>>> I don't have a good test bed to reproduce the problem reliably, but I should know
>>> after a few days if the flush works at all.  If not, then it's a moot point anyway.
>>
>> So, this appears to at least partially work.
>>
>> But, what we notice is that when using multiple station vdevs, the system pretty much
>> becomes useless if we get any significant number of stuck or slow-to-transmit management
>> buffers over WMI.  Part of this is because WMI messages are sent when holding rtnl
>> much of the time, I think.
>
> Most, if not all, WMI commands are sent while holding conf_mutex. This
> lock is taken in many situations including when RTNL is held so your
> observation isn't entirely correct but isn't wrong either.
>
>
>> I would guess that an AP with lots of peers associated might have similar problems
>> if peers are not ACKing packets reliably.
>
> It's not the ACKing per se. It's whether stations are asleep and
> unresponsive or not. You could do funny DoS attacks with a single
> ath9k card (using virtual stations) on ath10k APs now I guess :-)

In our lab we have some setups where there should be no power-save at all,
but still see this issue.  Unlucky (or nefarious) broken-ness in the peer can seem to
mostly hang the local system due to the 'not entirely correct' assumption above :)

>> Probably the only useful way to fix this is to make the firmware and driver able to
>> send management frames over the normal transport like every other data packet?
>
> Agreed. HTT should've been used for entire traffic, including management frames.
>
> The workaround could've been to guarantee to have only 1 wmi-mgmt-tx
> in-flight but since tx-credits aren't replenished predictably you'll
> end up with the patch I originally did, i.e. sleep 2*bcn intval and
> wmi-peer-flush-tids after each unicast mgmt frame to a known station.

Even assuming I have the tx-credits replenishment fixed,
that work-around would make sending sending mgt frames to many peers
very slow when at least a few peers are not answering quickly, right?

>> Any idea what it wasn't written like that to begin with?
>
> Beats me.

This might be something I can fix in CT firmware..but trying to kick a release out
the door, so I think I'll put this off for a bit.

Thanks,
Ben

-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com