Firmware crash when sending large numbers of forwarded packets

Mon Mar 10 06:16:53 EDT 2014

On 10 March 2014 11:10, Michal Kazior <michal.kazior at tieto.com> wrote:
> On 8 March 2014 09:20, Avery Pennarun <apenwarr at gmail.com> wrote:
>> On Sat, Mar 8, 2014 at 2:03 AM, Kalle Valo <kvalo at qca.qualcomm.com> wrote:
>>> Avery Pennarun <apenwarr at gmail.com> writes:
>>>> I'm having a problem where if I transmit too fast out the ath10k
>>>> interface in AP mode, I get a near-immediate firmware crash.
>>>
>>> [...]
>>>
>>>> Versions:
>>>> - kernel is based on current kvalo/for-linville branch (should I try
>>>> something else?) but seems to be the same in linux-next-20140114 so I
>>>> don't think this behaviour has changed lately.
>>>
>>> I do not recommend using for-linville branch for anything. As the name
>>> implies, it's only for John Linville to pull ath10k and ath6kl changes
>>> to his tree.
>>>
>>> What I recommend is to use the master branch of my ath.git tree. That's
>>> fairly recent wireless-testing (max 2 weeks old) plus latest ath10k +
>>> ath6kl patches I have (ie. merge of wireless-testing and my ath-next
>>> branch).
>>
>> Ok, thanks.  We're using a fairly old kernel on our device right now
>> (3.2.26) so we're using the ath10k driver from linux-backports.  This
>> means it's a little tricky to pick an arbitrary version if it has
>> diverged to far from linux/master or linux-next.  I did try a few
>> different versions though and they did the same thing.
>>
>>>> - firmware version 10.1.467.2-1, but also tested with 10.1.467.1-1
>>>> with no difference.
>>>>
>>>> I assume other people are not experiencing this or they would have
>>>> mentioned it by now.  What can I do to help debug this?
>>>
>>> We have reported the issue to the firmware team and got some feedback
>>> already. Hopefully we know more early next week.
>>
>> Thanks!
>>
>> Another update.  On a whim, based on the earlier mention that problems
>> might be related to extra burstiness of forwarding vs. local traffic
>> generation, I decided to add a udelay() before transmitting each
>> packet.  I started with udelay(1000) and the problem went away
>> (although of course performance was terrible).  I slowly reduced the
>> delay until I reached ndelay(1), and the problem stayed gone.  So I
>> tried a mb() instead:
>>
>> diff --git a/drivers/net/wireless/ath/ath10k/ce.c
>> b/drivers/net/wireless/ath/ath10k/ce.c
>> index a79499c..a808d82 100644
>> --- a/drivers/net/wireless/ath/ath10k/ce.c
>> +++ b/drivers/net/wireless/ath/ath10k/ce.c
>> @@ -291,6 +291,7 @@ int ath10k_ce_send_nolock(struct ath10k_ce_pipe *ce_state,
>>   if (ret)
>>   return ret;
>>
>> + mb();
>>   if (unlikely(CE_RING_DELTA(nentries_mask,
>>     write_index, sw_index - 1) <= 0)) {
>>   ret = -ENOSR;
>> --
>> 1.9.0.279.gdc9e3eb
>>
>>
>> Somehow this eliminates my firmware crashes.  It's extremely reliable;
>> add this line and my crashes go away.  Remove this line and my UDP
>> iperf can crash the firmware in a couple of seconds.
>>
>> For this particular test I was using a backports built from linux
>> v3.11.8 merged with your ath10k-stable-3.11-8 tag.
>>
>> Any idea why this would make any difference?
>
> The FW dump is supposedly related to it seeing a duplicate msdu_id tx request.
>
> ath10k fills in a tx descriptor. The descriptor contains an id which
> is used for completion handling (FW signals which id completed).
> ath10k uses a spinlock protected bitmap to manage this metadata.
> Descriptors are alloced via dma pool (consistent dma memory).
>
> It is highly unlikely for ath10k to pick duplicate msdu_id in the
> first place - you'd have to assume spinlock fail which would suggest
> your system would be pretty fun. This leaves either low level chunk
> submission is at play or DMA goes crazy.
>
> The descriptor is transfered in two chunks over CE ring. The
> ce_send_nolock is used to submit each separately (via pci_tx_sg). The
> first contains msdu id, the other one is the msdu partial as frame
> prefetch for FW classification engine. Once the second chunk is
> submitted CE ringbuffer index is written to iomap.
>
> If I assume this is DMA coherency issue, then msdu_id the device sees
> is the old one (that has been overwritten but hasn't been flushed from
> CPU caches yet). Then this is a platform bug, not ath10k one.
>
> If I assume this is chunk submission ordering issue (CE ring item is
> updated _after_ ring index in iomap is updated) then the device uses
> an old tx descriptor pointer and an old (or re-used and currently used
> msdu_id -- remember all descriptors come from dma pool which I assume
> re-uses memory chunks). Then this is ath10k bug.
>
> The latter is a little more plausible because mb() fixes. udelay()
> might implicitly do the same thing.

Now that I think you can probably try placing the mb() after `*desc =
*desc`, or better, right before ath10k_ce_src_ring_write_index_set()
(inside the conditional). This is will make sure all prior CE items
are ready and set before CE ring index is updated.

Michał