[REGRESSION] ath10k: failed to flush transmit queue
James Prestwood
prestwoj at gmail.com
Thu Feb 20 05:55:54 PST 2025
Hi All,
On 7/31/24 11:13 AM, Kalle Valo wrote:
> Felix Fietkau <nbd at nbd.name> writes:
>
>> On 12.07.24 04:23, Cedric Veilleux wrote:
>>
>>> AP mode.
>>> Both 2.4 and 5ghz channels.
>>> Using WLE600VX (QCA986x/988x), we are seeing the following errors in
>>> kernel logs:
>>> [12978.022077] ath10k_pci 0000:04:00.0: failed to flush transmit
>>> queue
>>> (skip 0 ar-state 1): 0
>>> [13343.069189] ath10k_pci 0000:04:00.0: failed to flush transmit queue
>>> (skip 0 ar-state 1): 0
>>> They are somewhat random but frequent. Can happen once a day or many
>>> times per hour.
>>> They are associated with 3-4 seconds of radio silence. Full packet
>>> loss. Then everything resumes normally, STA are still associated and
>>> traffic resumes.
>>> I have tested with major kernel versions:
>>> 6.1.97: stable (tested for many days on 10+ access points)
>>> 6.2.16: stable (tested for few hours single machine)
>>> 6.3.13: stable (tested for few hours single machine)
>>> 6.4.16: unstable (we have errors within an hour)
>>> 6.5.13: unstable (we have errors within an hour)
>>> 6.6.39: unstable (we have errors within an hour)
>>> 6.7.12: unstable (we have errors within an hour)
>>> 6.8.10: unstable (we have errors within an hour)
>>> 6.9.7: unstable (we have errors within an hour)
>>> From these tests I believe something changed in 6.4 series causing
>>> instabilities and the dreaded "failed to flush transmit queue" error.
>>> This is a custom linux distribution. Only change is the kernel. All
>>> other packages are same versions. Everything rebuilt from source using
>>> bitbake/yocto. Same linux-firmware files.
>> I'm pretty sure it's caused by this commit:
>>
>> commit 0b75a1b1e42e07ae84e3a11d2368b418546e2bec
>> Author: Johannes Berg <johannes.berg at intel.com>
>> Date: Fri Mar 31 16:59:16 2023 +0200
>>
>> wifi: mac80211: flush queues on STA removal
>>
>> I guess somebody needs to look into making the queue flush on ath10k
>> more reliable (or even better, implement a more lightweight .flush_sta
>> op).
>>
>> I don't have time to do the work myself, but hopefully this
>> information could help somebody else take care of it.
> Adding ath10k list so that everyone see this.
I want to revive this thread and provide some additional data. This is
not just something that happens in AP mode, or specifically with the
hardware mentioned. After upgrading from 6.2 to 6.8 we started seeing
this on client devices running the QCA6174 hw 3.2 firmware ver
WLAN.RM.4.4.1-00288- api 6. We see it during disconnects which isn't as
big of a deal, the more concerning time is during roams which makes
roams go from less than 200ms to over 5 seconds.
Based on this report I have tried using Remi's set of patches [1] which
implement flush_sta(), but we end up with the same ~5 second hang, just
in ath10k_flush_sta() instead of ath10k_flush(). I'm unsure if this is a
firmware problem, or some race within the driver itself. In the past I
have reduced timeouts [2] to work around these type of things but its
really just a band-aid.
I would agree that this was "introduced" by Johannes' commit above, but
the original commit does make sense... This is just an ath10k problem
with flushing the queue's.
At this point I'm really left with two options:
- Revert Johannes commit to flush the queues, thereby reducing
security, OR
- Reduce the timeout from 5 seconds to something more manageable, like
1 second (hopefully someone more in the know can comment here).
Has anyone else looked at this regression? Maybe has some workaround
other than my options above?
Thanks,
James
[1]
https://lore.kernel.org/linux-wireless/17d26d6a3e80ff03939ee7935fdc07f979b61a4f.1732293922.git.repk@triplefau.lt/
[2]
https://lore.kernel.org/linux-wireless/20240814164507.996303-2-prestwoj@gmail.com/
More information about the ath10k
mailing list