Reproducible issue in hacked 3.17 kernel, CT firmware

Wed Jan 7 10:13:54 PST 2015

On 01/07/2015 05:38 AM, Ben Greear wrote:
> 
> 
> On 01/07/2015 01:58 AM, Michal Kazior wrote:
>> On 30 December 2014 at 20:18, Ben Greear <greearb at candelatech.com> wrote:
>>> yeah, so maybe not reproducible upstream, but anyway...
>>>
>>> My test case is to re-associate 4 stations over and over again, with
>>> a scan and a 5 second sleep between iterations.  After
>>> a short time, something goes weird and OS is mostly hung, probably
>>> because important locks are held while ath10k is timing out communication
>>> to firmware.
>>>
>>> The last message I see from firmware is that it is deleting vdev 4.
>>>
>>> I do not see any indication that firmware is crashed, but something
>>> is wrong, maybe mgt buffers are used up?
>> [...]
>>> [  342.962494] ath10k_pci 0000:04:00.0: failed to set erp slot for vdev 4: -11
>>
>> -11 = -EAGAIN = out of wmi-htc tx credits. I wonder what the dbg
>> buffer is trying to say.
>>
>> Either host sent a corrupted message and clogged up firmware buffers,
>> firmware is busy processing other commands (wmi mgmt tx, wmi bcn
>> non-dma tx) or became confused/corrupted.
> 
> I finally got back to debugging this yesterday, and interestingly, when
> I added dbglog calls in the firmware around the credit handling, the problem is 'fixed'.
> 
> Looks like it ran overnight, where as before it would fail within a few minutes.
> 
> So, maybe a race around pci memory flushing or something like that?
> 
> I'll slowly back out my debug today and see what I can see.

It finally locked up this morning...I see last credit consumed at 8:37:02, and then
finally I get two credits from the firmware at 9:12:42.

I guess more instrumentation is required :P

Thanks,
Ben

-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com