Hard lockup during vif restart tests.

Wed Sep 17 23:23:17 PDT 2014

On 17 September 2014 17:52, Ben Greear <greearb at candelatech.com> wrote:
> On 09/16/2014 11:34 PM, Michal Kazior wrote:
>> On 16 September 2014 20:42, Ben Greear <greearb at candelatech.com> wrote:
>>> This is on a 3.14.14+ hacked kernel, with CT firmware.
>>>
>>> Test case is to restart stations (and the AP
>>> on the other side) every 10-30 seconds.
>>> After a bit, the station machine locked up hard.
>>>
>>> I have no idea how to trouble-shoot this better, so this is
>>> just FYI.
>>>
>> [...]
>>> ath10k: boot warm reset complete
>>> ath10k: failed to power up target using warm reset: -110
>>> ath10k: trying cold reset
>>> ath10k: boot cold reset
>>> ath10k: boot cold reset complete
>>> [hang, even sysrq will not work]
>>
>> There's a known problem with cold reset being capable of locking up
>> entire system (depends on the pci-e controller, e.g. AP135 splats a
>> Data Bus Error instead).
>>
>> Actually warm reset can do the same in some corner cases: try running
>> Rx traffic and just start the recovery sequence (without actually
>> crashing the fw). My x86 locks up very easily with this.
>>
>> I strongly suggest you use reset_mode=1 when you load ath10k_pci so
>> cold reset isn't used. This may result in ath10k being unable to bring
>> up the device in some rare cases (e.g. after an IOMMU fault if your
>> system supports it) but I believe it's far better than having the
>> whole system lock up.
>>
>> My suspicion is tx/rx rings, dma transfer engines, internal irqs
>> aren't stopped properly. I have a prototype patch for the warm reset
>> problem but it's incomplete and I'm not sure if I can share it yet.
>
> I will try the warm-reset-only flag, and I do hope you have success
> with the warm/cold reset fixes.

It sort of works as it is now but it's ugly.

> But, I still wonder if we could just reset less often and maybe
> make it a bit harder to hit these problems?
>
> Why do we reset the firmware/NIC when we admin down/up the
> vif (when a single vif is active)?  Couldn't we just keep
> the firmware active in this state and not risk lockup due
> to reset?

If you put down last interface mac80211 calls drv_stop(). There isn't
any real need to keep the device up and running after that other than
trying to workaround the reset issue. But then you need to deal with
firmware quirks. I recall it could report Rx indications after all
vdevs had been removed (and this is now also observable with 10.2
during probing/bootup). It's just simpler to reboot firmware on
drv_stop/start().

Michał