[PATCH ath-current 0/7] wifi: ath12k: Fix Issues in REO RX Queue Updates

Wed Sep 10 22:58:15 PDT 2025

Hi Nicolas,

Sorry for the delayed reply.

I agree with your suggestion - it’s a good direction, but it will
require extensive testing since it impacts all REO commands, not just
the RX queue update.

Given that the retry mechanism is a critical fix, I’d prefer to
proceed with that series first. I’ll follow up with an incremental
version that implements handling of REO commands using a workqueue.

Thanks again for your input.

Best regards,
Nithy

On Thu, Aug 7, 2025 at 7:09 PM Nicolas Escande <nico.escande at gmail.com> wrote:
>
> On Thu Aug 7, 2025 at 2:31 PM CEST, Nithyanantham Paramasivam wrote:
> > On Thu, Aug 7, 2025 at 1:32 AM Nicolas Escande <nico.escande at gmail.com> wrote:
> >>
> >> On Wed Aug 6, 2025 at 1:17 PM CEST, Nithyanantham Paramasivam wrote:
> >> > During stress test scenarios, when the REO command ring becomes full,
> >> > the RX queue update command issued during peer deletion fails due to
> >> > insufficient space. In response, the host performs a dma_unmap and
> >> > frees the associated memory. However, the hardware still retains a
> >> > reference to the same memory address. If the kernel later reallocates
> >> > this address, unaware that the hardware is still using it, it can
> >> > lead to memory corruption-since the host might access or modify
> >> > memory that is still actively referenced by the hardware.
> >> >
> >> > Implement a retry mechanism for the HAL_REO_CMD_UPDATE_RX_QUEUE
> >> > command during TID deletion to prevent memory corruption. Introduce
> >> > a new list, reo_cmd_update_rx_queue_list, in the dp structure to
> >> > track pending RX queue updates. Protect this list with
> >> > reo_rxq_flush_lock, which also ensures synchronized access to
> >> > reo_cmd_cache_flush_list. Defer memory release until hardware
> >> > confirms the virtual address is no longer in use, avoiding immediate
> >> > deallocation on command failure. Release memory for pending RX queue
> >> > updates via ath12k_dp_rx_reo_cmd_list_cleanup() on system reset
> >> > if hardware confirmation is not received.
> >> >
> >> > Additionally, increase DP_REO_CMD_RING_SIZE to 256 to reduce the
> >> > likelihood of ring exhaustion. Use a 1KB cache flush command for
> >> > QoS TID descriptors to improve efficiency.
> >> >
> >>
> >> Hello,
> >>
> >> I'm not sure I fully understand so please correct where I'm wrong but from what
> >> I got looking at the code:
> >>   - you increase the ring size for reo cmd
> >>   - you enable releasing multiple tid buffer at once
> >>   - as soon as you allocate the tid you create an entry in the list flagging it
> >>     as active
> >>   - when you need to clean up a tid
> >>     - you mark it as inactive in the list, then
> >>     - try to process the whole list by
> >>       - sending the reo command to release it
> >>       - if it succeed you release it and remove the entry from the list
> >>   - on core exit, you re process the list
> >>
> >> What is bothering me with this is that when a vdev which has multiple sta goes
> >> down, you will have a lot of those entries to push to the firmware at once:
> >>
> >>   - So increasing the ring size / being able to release multiple entries at once
> >>     in one reo cmd may or may not be enough to handle the burst. It seems
> >>     that increasing those is just band aids on the underlying problem but I
> >>     understand it's just to be on the safe side.
> >>   - If entries do not get send/accepted by the firmware, we will need to wait
> >>     for another station removal before retrying, which could be a wile.
> >>   - Or in the worst case (one vdev going down and needing to release tid of
> >>     all its stations) the more entries we have in the list the less likely we
> >>     will be to be able to push the delete of all stations + the ones still in
> >>     queue
> >>
> >> So, on station removal, why not make just things completely async. Push the tid
> >> deletes in a list and wake a workqueue that tries to push those to the firmware.
> >> If the ring is full, retry periodically.
> >>
> >> Thanks
> >
> > Hi Nicolas,
> Hi
> >
> > Thanks for the detailed observations and suggestions.
> >
> > You're right in your understanding of the flow. To clarify further:
> >
> > When the host fails to obtain a buffer from the hardware to send a
> > command—due to the REO command ring being full
> > (ath12k_hal_reo_cmd_send returning -ENOBUFS)—we treat it as a command
> > send failure and avoid deleting the corresponding entry immediately.
> >
> > This situation typically arises in the flow:
> >
> > ath12k_dp_rx_process_reo_cmd_update_rx_queue_list →
> > ath12k_dp_rx_tid_delete_handler → returns -ENOBUFS
> >
> > Given the command ring size is 256, and each peer generally has 16
> > TIDs, each TID sends 2 commands (RXQ update and cache flush). So,
> > deleting one peer involves up to 32 commands. This means we can delete
> > up to 8 peers (8 × 32 = 256 commands) before hitting the ring limit.
> >
> > From the 9th peer onward, we may encounter ring exhaustion. However,
> > we handle retries in the REO command callback
> > (ath12k_dp_rx_tid_del_func). If commands fail to send, they remain in
> > the pending list and can be retried via the success callback of
> > earlier commands.
> That was the part I did not get, I thought it was just another peer removal that
> would cause the whole list of pending entries to be reprocessed and pushed to
> the ring.
> >
> > Do we foresee any issue with this ?
> >
> Nope, it should work.
> > Regarding your suggestion to make the deletion process fully
> > asynchronous via a workqueue, that’s a valid point. Workqueue-based
> > handling is a good idea, but we need to account for potential delays
> > if the worker thread isn’t scheduled promptly. We also need to
> > consider timeout-based rescheduling, especially during command
> > failures, to ensure retries happen in a timely manner. We’ll evaluate
> > this suggestion further and get back to you.
> >
> IMHO it feels like it would simplify the code to just push + wake to a wq on
> delete, and reschedule if no space is available in the ring. The timing should
> not be such a big deal right ? The important part is to eventually push all reo
> commands to the firmware but it should not have impact on the rest of the
> operation if it reaches the firmware later that sooner right ?
>
> > Thanks again for the insightful feedback!
> Thanks for all the explanations
> >
> > Best Regards,
> > Nithy
>