Firmware crash when sending large numbers of forwarded packets

Sun May 11 18:43:56 PDT 2014

On Mon, Mar 10, 2014 at 6:16 AM, Michal Kazior <michal.kazior at tieto.com> wrote:
> On 10 March 2014 11:10, Michal Kazior <michal.kazior at tieto.com> wrote:
>> It is highly unlikely for ath10k to pick duplicate msdu_id in the
>> first place - you'd have to assume spinlock fail which would suggest
>> your system would be pretty fun. This leaves either low level chunk
>> submission is at play or DMA goes crazy.
>>
>> The descriptor is transfered in two chunks over CE ring. The
>> ce_send_nolock is used to submit each separately (via pci_tx_sg). The
>> first contains msdu id, the other one is the msdu partial as frame
>> prefetch for FW classification engine. Once the second chunk is
>> submitted CE ringbuffer index is written to iomap.
>>
>> If I assume this is DMA coherency issue, then msdu_id the device sees
>> is the old one (that has been overwritten but hasn't been flushed from
>> CPU caches yet). Then this is a platform bug, not ath10k one.
>>
>> If I assume this is chunk submission ordering issue (CE ring item is
>> updated _after_ ring index in iomap is updated) then the device uses
>> an old tx descriptor pointer and an old (or re-used and currently used
>> msdu_id -- remember all descriptors come from dma pool which I assume
>> re-uses memory chunks). Then this is ath10k bug.
>>
>> The latter is a little more plausible because mb() fixes. udelay()
>> might implicitly do the same thing.
>
> Now that I think you can probably try placing the mb() after `*desc =
> *desc`, or better, right before ath10k_ce_src_ring_write_index_set()
> (inside the conditional). This is will make sure all prior CE items
> are ready and set before CE ring index is updated.

Sorry to take a long time to get back on this thread, but I finally
have some better test results.

First, our stability has definitely been improved a lot since my
original mb() patch.

Secondly, we still had occasional firmware crashes under certain load
types.  Thanks to Ben Greear and some helpful people at QCA, we
diagnosed this as the same problem as the original mb() was trying to
fix.  After moving the mb() as Michal suggested, this class of
firmware crash (which was much more rare after my original patch)
seems to be gone too.  So the new location of the mb() seems to indeed
be better.

We still have rare firmware crashes for I think a different reason(s),
but I have/will ask about those in other threads.

Thanks for the help!

Avery