ath10k + INTEL_IDLE aka. cstates == firmware crash
Fabian Wittenberg
Fabian.Wittenberg at sophos.com
Mon Feb 23 05:44:53 PST 2015
Hi Michal,
I used firmware version 10.1 and 10.2 from here:
https://github.com/kvalo/ath10k-firmware. Both show the same behavior.
You are right. There are some BIOS that do strange handling of this
cstate stuff but we have no influence on the BIOS as this is done by
our hardware vendor. We experimented a lot with the MSI masking bit of
the pci-e root bridge where the ac-card is connected to.
There were no remarkable improvements playing around with this bit.
We have tested the same boards with cards that need ath9k as well. They
are working just fine. With and without enabled INTEL_IDLE...
Regards,
Fabian
Am 23.02.2015 um 14:32 schrieb Michal Kazior:
> On 23 February 2015 at 14:08, Fabian Wittenberg
> <Fabian.Wittenberg at sophos.com> wrote:
>> Hi at all,
>>
>> we are using the brand new QCA988x chipset based on mini-PCIe cards in our newest wifi enabled firewall appliance and we have had
>> a lot of problems to get it running (Intel Rangeley platform; Intel(R) Atom(TM) CPU C2558 @ 2.40GHz).
> I recall one guy complained his Atom-based laptop wasn't happy running
> ath10k either but I think it was some electrical incompatibility and
> the machine didn't even POST when the card was plugged into mPCIe
> slot.
>
>
>> The card crashed after some minutes using ath10k-driver (backports-3.19-rc1). Older versions are affected as well.
>> At least down to 3.12.20. I did intensive debugging and found out, that there
>> are major issues as soon as Intels processor cstates are used. This
>> option is called "CONFIG_INTEL_IDLE" in kernel config. This seems to be
>> a very heavy issue as it even can lead to low memory corruption and
>> kernel freezes. Low memory corruption doesn't occure always; just sometimes. This makes it hard to debug.
>> Also you need a multi processor system to trigger the issue.
>> If you set kernel parameter "maxcpus=1" the error doesn't occure even if you enable CONFIG_INTEL_IDLE.
> Through a quick search I've found this:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=715485
>
> It looks like some BIOSes can have buggy C-state handling. Maybe
> that's the root cause? From my experience QCA988x can be sometimes
> quirky when it comes to PCIe so I wouldn't be surprised if other
> devices don't crash.
>
>
>> Kernel output looks like this if the card stops working:
>>
>>
>> [ 3715.145865] ath10k: failed to install key for vdev 2 peer 00:1a:8c:0a:b5:01: -11
>> [ 3715.145876] wifi1: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11)
>> [ 3718.148226] ath10k: failed to install key for vdev 2 peer 00:1a:8c:0a:b5:01: -11
>> [ 3718.148236] wifi1: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11)
>> [ 3723.152167] ath10k: failed to install key for vdev 0 peer 00:1a:8c:0a:34:01: -11
>> [ 3723.152178] wifi0: failed to remove key (1, ff:ff:ff:ff:ff:ff) from hardware (-11)
>> [ 3723.152185] ath10k: failed to transmit management frame via WMI: -11
>> [ 3726.154524] ath10k: failed to install key for vdev 0 peer 00:1a:8c:0a:34:01: -11
>> [ 3726.154535] wifi0: failed to set key (1, ff:ff:ff:ff:ff:ff) to hardware (-11)
>> [ 3729.156884] ath10k: failed to install key for vdev 0 peer 00:0e:8e:ae:5c:1c: -11
>> [ 3729.156890] ath10k: failed to transmit management frame via WMI: -11
>> [ 3729.156904] wifi0: failed to remove key (0, 00:0e:8e:ae:5c:1c) from hardware (-11)
>> [ 3732.159255] ath10k: failed to remove peer wep key 0: -11
>> [ 3732.159265] ath10k: failed to clear all peer wep keys for vdev 0: -11
>> [ 3732.159273] ath10k: failed to disassociate station: 00:0e:8e:ae:5c:1c vdev 0: -11
> [...]
>
> It seems firmware stopped replenishing WMI-HTC Tx credits. It's most
> likely not the mgmt-related tx credit starvation but instead
> communication with the device is really broken.
>
>
>> Sometimes but not allways there is the message "firmware crashed!" in dmesg but it doesn't matter which error message it actually is:
>> The behavior is allways the same. The card stops working until reboot. Unloading/reloading of ath10k_pci, ath10k_core, ath doesn't help in this case.
>> The basic problems of all error messages I saw by now is a broken link between the cards firmware and the ath10k-driver.
>> Depending on the point in time this "connection loss" happens the error messages are a little bit different,
>> as they are strongly connected to the current state of the driver while it is trying to talk to the cards firmware via WMI.
>>
>> If you try to reproduce you have to wait between 3 and 60 Minutes to see the crash. You can increase the likelyhood for crashing by increasing
>> the number of wifi traffic on foreign networks at the same channel.
>> I testet with four laptops that are connected to four QCA988x cards (AP-mode). This takes around 3-10 minutes to get it reproduced.
>>
>> If you need more information I'm at your disposal.
> It'd be nice to know what firmware you're using. Generally I would
> discourage from using 999.999.0.636 because it's very old.
>
>
> Michał
More information about the ath10k
mailing list