htt rx stopped. cannot recover

Pushpal Sidhu psidhu at gateworks.com
Tue Nov 18 17:23:26 PST 2014


On Mon, Nov 17, 2014 at 10:56 PM, Michal Kazior <michal.kazior at tieto.com> wrote:
> You can get super sure and take a look in dmesg/syslog to verify if
> ath10k's wlan interface is in promisc mode or not. I guess it's not
> though.
It seems that there's a "vdev 0". I'll look into hacking this out
later when I get some more time, though it seems that whenever a
station is associated with an AP, a vdev is always created.

>> ...
>> I tried not setting
>> the confused flag when the issue was hit, but eventually a kernel
>> panic would occur.
>
> I'm glad you like them.
>
> The result of not setting the confused flag in this case is pretty expected :-)
>
>> Something very interesting that I found was I only see this issue when
>> the radio's are behind a PCIe bridge. That is, I tested on two set's
>> of boards with the same CPU/structure, with the one difference that
>> there is no PCIe bridge on one. I am currently running an extended
>> test on the board without PCIe to see if it isn't an issue that will
>> manifest itself at a later time, but as of now, it's looking like
>> that's no the case. Do you have any thoughts on this?
>
> Hmm.. Perhaps there's some kind of race and the host main memory isn't
> updated in a timely fashion? In that case ath10k could try to, instead
> of immediatelly unmapping, keep on syncing dma to cpu and checking the
> MSDU_DONE bit for a few msecs (what an ugly workaround..) before it
> finally unmaps the frame.
>
> I wonder if that's a bug in ath10k's target device PCIe controller or
> the PCIe bridge?

While running the extended test, at about 7 hours of continuous
testing with the two boards without a PCIe bridge (configured as a
wireless bridge e.g. 4addr=on), the STAtion has a kernel panic that
seems to be caused by a double free or something similar. Log is
located here (glad I set 'debug' in kernel cmdline):
http://dpaste.com/1G77X7N (most interesting info located between times
[25684.512971] - [25684.587585]).

This is basically the same kernel oops I saw earlier when I disabled
the confused flag (occurs on skb_release_data). I've ruled out a SMP
issue as I can recreate this with nosmp. Because I see this issue on a
board without a PCIe bridge, I'm leaning towards the root cause of the
issue to maybe be in the ath10k's target device PCIe controller (irq's
being fired is what lead to this code path).

When I get some time later, I'll try that workaround, though I don't
think that's the "right" way to fix this problem.

- Pushpal



More information about the ath10k mailing list