[PATCH v2 ath-next 2/2] wifi: ath11k: fix HTC rx insufficient length

Miaoqing Pan quic_miaoqing at quicinc.com
Wed Mar 12 18:41:51 PDT 2025



On 3/13/2025 12:43 AM, Johan Hovold wrote:
> On Wed, Mar 12, 2025 at 09:11:45AM +0800, Miaoqing Pan wrote:
>> On 3/11/2025 11:20 PM, Jeff Johnson wrote:
>>> On 3/11/2025 1:29 AM, Miaoqing Pan wrote:
>>>> On 3/10/2025 6:09 PM, Johan Hovold wrote:
>>>>> I'm still waiting for feedback from one user that can reproduce the
>>>>> ring-buffer corruption very easily, but another user mentioned seeing
>>>>> multiple zero-length descriptor warnings over the weekend when running
>>>>> with this patch:
>>>>>
>>>>> 	ath11k_pci 0006:01:00.0: rxed invalid length (nbytes 0, max 2048)
>>>>>
>>>>> Are there ever any valid reasons for seeing a zero-length descriptor
>>>>> (i.e. unrelated to the race at hand)? IIUC the warning would only be
>>>>> printed when processing such descriptors a second time (i.e. when
>>>>> is_desc_len0 is set).
>>>>
>>>> That's exactly the logic, only can see the warning in a second time. For
>>>> the first time, ath12k_ce_completed_recv_next() returns '-EIO'.
>>>
>>> That didn't answer Johan's first question:
>>> Are there ever any valid reasons for seeing a zero-length descriptor?
>>
>> The events currently observed are all firmware logs. The discarded
>> packets will not affect normal operation. I will adjust the logging to
>> debug level.
> 
> That still does not answer the question whether there are ever any valid
> reasons for seeing zero-length descriptors. I assume there are none?
> 
>>> We have an issue that there is a race condition where hardware updates the
>>> pointer before it has filled all the data. The current solution is just to
>>> read the data a second time. But if that second read also occurs before
>>> hardware has updated the data, then the issue isn't fixed.
>>   
>> Thanks for the addition.
>>
>>> So should there be some forced delay before we read a second time?
>>> Or should we attempt to read more times?
>>
>> The initial fix was to keep waiting for the data to be ready. The
>> observed phenomenon is that when the second read fails, subsequent reads
>> will continue to fail until the firmware's CE2 ring is full and triggers
>> an assert after timeout. However, this situation is relatively rare, and
>> in most cases, the second read will succeed. Therefore, adding a delay
>> or multiple read attempts is not useful.
> 
> The proposed fix is broken since ath11k_hal_ce_dst_status_get_length()
> not just reads the length but also sets it to zero so that the updated
> length may indeed never be seen.
> 
Good point!

> I've taken a closer look at the driver and it seems like we're missing a
> read barrier to make sure that the updated descriptor is not read until
> after the head pointer.
> 
> Miaoqing, could you try the below patch with your reproducer and see if
> it is enough to fix the corruption?
> 
Sure, the stress test is running.

> If so I can resend with the warning removed and include a corresponding
> fix for ath12k (it looks like there are further places where barriers
> are missing too).
> 
> Johan
> 
> 
If the DMA read barrier works, do you think my submitted patch series is 
still needed? Because the error handling is incorrect.


>  From 656dbd0894741445aeb16ee8357e6fef51b6084c Mon Sep 17 00:00:00 2001
> From: Johan Hovold <johan+linaro at kernel.org>
> Date: Wed, 12 Mar 2025 16:49:20 +0100
> Subject: [PATCH] wifi: ath11k: fix ring-buffer corruption
> 
> Users of the Lenovo ThinkPad X13s have reported that Wi-Fi sometimes
> breaks and the log fills up with errors like:
> 
> 	ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1484, expected 1492
> 	ath11k_pci 0006:01:00.0: HTC Rx: insufficient length, got 1460, expected 1484
> 
> which based on a quick look at the driver seemed to indicate some kind
> of ring-buffer corruption.
> 
> Miaoqing Pan tracked it down to the host seeing the updated destination
> ring head pointer before the updated descriptor, and the error handling
> for that in turn leaves the ring buffer in an inconsistent state.
> 
> Add the missing the read barrier to make sure that the descriptor is
> read after the head pointer to address the root cause of the corruption.
> 
> The error handling can be fixed separately in case there can ever be
> actual zero-length descriptors.
> 
> FIXME: remove WARN_ON_ONCE() added for verification purposes
> 
> Tested-on: WCN6855 hw2.1 WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.41
> 
> Fixes: d5c65159f289 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218623
> Link: https://lore.kernel.org/20250310010217.3845141-3-quic_miaoqing@quicinc.com
> Cc: Miaoqing Pan <quic_miaoqing at quicinc.com>
> Cc: stable at vger.kernel.org	# 5.6
> Signed-off-by: Johan Hovold <johan+linaro at kernel.org>
> ---
>   drivers/net/wireless/ath/ath11k/ce.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/wireless/ath/ath11k/ce.c b/drivers/net/wireless/ath/ath11k/ce.c
> index e66e86bdec20..423b970e288c 100644
> --- a/drivers/net/wireless/ath/ath11k/ce.c
> +++ b/drivers/net/wireless/ath/ath11k/ce.c
> @@ -393,8 +393,12 @@ static int ath11k_ce_completed_recv_next(struct ath11k_ce_pipe *pipe,
>   		goto err;
>   	}
>   
> +	/* Make sure descriptor is read after the head pointer. */
> +	dma_rmb();
> +
>   	*nbytes = ath11k_hal_ce_dst_status_get_length(desc);
>   	if (*nbytes == 0) {
> +		WARN_ON_ONCE(1);	// FIXME: remove
>   		ret = -EIO;
>   		goto err;
>   	}




More information about the ath11k mailing list