Crash in hacked kernel with CT firmware.

Thu Jul 31 00:58:49 PDT 2014

On 30 July 2014 17:46, Ben Greear <greearb at candelatech.com> wrote:
> Not sure how relevant this is to upstream, but just in case someone
> wants to look at it:
>
> Kernel is modified 3.14.14+, with a good bit of backported ath10k and some
> patches of my own to help stabilize ath10k with my workload and to support
> CT firmware features.
>
> http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.14.dev.y/.git;a=summary
>
> Firmware is CT firmware, and it has a bug in this test case where it crashes
> fairly often upon removal of a vdev after some traffic tests have been
> running.  Likely this firmware bug is something that I have added or
> at least exacerbated, and I am working on fixing it.
>
> But, when it crashes, it takes the kernel down shortly afterwards
> in a reliable manner:
[...]
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000068
> IP: [<ffffffffa06a318d>] ath10k_txrx_tx_unref+0x91/0x3c7 [ath10k_core]
[...]
> Call Trace:
>  [<ffffffffa06a28b4>] ath10k_htt_tx_detach+0x70/0xd1 [ath10k_core]
>  [<ffffffffa06a04cf>] ath10k_htt_detach+0x16/0x1b [ath10k_core]
>  [<ffffffffa069eab3>] ath10k_core_stop+0x4f/0x70 [ath10k_core]
>  [<ffffffffa069ae32>] ath10k_halt+0xde/0x161 [ath10k_core]
>  [<ffffffffa069aeed>] ath10k_stop+0x38/0x89 [ath10k_core]
>  [<ffffffffa05b0ae6>] ieee80211_stop_device+0x58/0x84 [mac80211]
>  [<ffffffffa069541c>] ? spin_lock_bh+0x9/0xb [ath10k_core]
>  [<ffffffffa059d0d3>] ieee80211_do_stop+0x625/0x67d [mac80211]
>  [<ffffffff810fdf6a>] ? trace_hardirqs_on+0xd/0xf
>  [<ffffffff810c6d42>] ? __local_bh_enable_ip+0xaf/0xd9
>  [<ffffffff815d8156>] ? _raw_spin_unlock_bh+0x31/0x35
>  [<ffffffff8153a693>] ? dev_deactivate_many+0x129/0x172
>  [<ffffffffa059d140>] ieee80211_stop+0x15/0x19 [mac80211]
[...]
> (gdb) l *(ath10k_txrx_tx_unref+0x91)
> 0xe18d is in ath10k_txrx_tx_unref (/mnt/sda/home/greearb/git/linux-3.14.dev.y/drivers/net/wireless/ath/ath10k/txrx.c:109).
> 104             }
> 105
> 106             msdu = htt->pending_tx[tx_done->msdu_id];
> 107             skb_cb = ATH10K_SKB_CB(msdu);
> 108
> 109             dma_unmap_single(dev, skb_cb->paddr, msdu->len, DMA_TO_DEVICE);

Okay.. So `msdu` is NULL. I can't seem to find unpaired used_msdu_ids
and pending_tx accesses. This suggests htt->pending_tx itself is
invalid (as well as used_msdu_ids) - perhaps use-after-free (both
pointers aren't NULLed). This in turn suggests ath10k_htt_tx_detach()
was called before and this is the second call. Stack trace suggests
the (allegadly second) call originates from drv_stop(). When ath10k
crashes ath10k_core_start() worker calls ath10k_halt() directly, sets
RESTARTING state and queues mac80211 hw restart. ath10k_stop() calls
ath10k_halt() only if state is ON, RESTARTED or WEDGED. RESTARTING
isn't one of them, but since you have more than 1 entry point for hw
recovery (pci indication, wmi_send, flush) you can trigger
ath10k_core_start() worker with RESTARTING state (i.e. crash within a
crash before ath10k_start() is called) which changes state to WEDGED.
WEDGED allows ath10k_halt() to be called in ath10k_stop(). QED.

The following (it has been in upstream for some time now) should fix
the problem:

commit c5058f5b82f226b236dc5a65015152ed3c23efff
Author: Michal Kazior <michal.kazior at tieto.com>
Date:   Mon May 26 12:46:03 2014 +0300

    ath10k: perform hw restart lazily

    This reduces risk of races and prepares for more
    hw restart fixes.

    It also makes sense to perform teardown after
    mac80211 starts its restart routine as it
    guarantees it has stopped itself by then
    (including tx queues).

    Signed-off-by: Michal Kazior <michal.kazior at tieto.com>
    Signed-off-by: Kalle Valo <kvalo at qca.qualcomm.com>

This probably makes your ieee80211_stop_queues() in ath10k_halt() obsolete too.

Michał