Deadlock on (faked) firmware crash, CUS239, modified 10.4.3 firmware.

Ben Greear greearb at candelatech.com
Thu Mar 31 12:16:49 PDT 2016


On 03/30/2016 11:32 PM, Michal Kazior wrote:
> On 31 March 2016 at 00:28, Ben Greear <greearb at candelatech.com> wrote:
>>
>>> Hmm.. If it still reproduces can you try the following diff?
>>>
>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>                   list_del_init(&artxq->list);
>>>                   if (ret != -ENOENT)
>>>                           list_add_tail(&artxq->list, &ar->txqs);
>>> +               else if (artxq == last)
>>> +                       last = list_last_entry(&ar->txqs, struct
>>> ath10k_txq, list);
>>>
>>>                   ath10k_htt_tx_txq_update(hw, txq);
>>
>>
>> Ok, I added this code, and can still reproduce the code.
>>
>> Firmware is crashing multiple times a minute in this machine in it's
>> current configuration.  Right before it hung, firmware crashed and
>> was restarted, and then I get the hang notification.
>>
>> I don't see any obvious bail-out in the tx_push_pending logic
>> if the firmware crashes?
>
> There's no explicit bail-out, yes. It should bail out if
> ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is
> treated slightly differently but should result in bail-out eventually
> as well as ar->txqs will drain until it's empty).
>
> HTT-tx doesn't check for FW crash but it should be ultimately limited
> by either CE ring size and HTT's num-pending-tx (both should not be
> replenished as FW crashed and interrupts should not come in anymore).
> Whichever the case a <0 retval should result in a bailout.

I tried adding check for FW crash yesterday, but that did not help.

Today, I added a limit of 2000 loops.  I see that hit, and then kernel
crashes.  Maybe my patch is wrong.

I've tried to apply (almost) every patch in linux.ath related to ath10k,
including a few from the mailing list that have not been applied yet.

My push-pending method now looks like this:

void ath10k_mac_tx_push_pending(struct ath10k *ar)
{
	struct ieee80211_hw *hw = ar->hw;
	struct ieee80211_txq *txq;
	struct ath10k_txq *artxq;
	struct ath10k_txq *last;
	int ret;
	int max;
	int loop_max = 2000;

	spin_lock_bh(&ar->txqs_lock);
	rcu_read_lock();

	last = list_last_entry(&ar->txqs, struct ath10k_txq, list);
	while (!list_empty(&ar->txqs)) {
		artxq = list_first_entry(&ar->txqs, struct ath10k_txq, list);
		txq = container_of((void *)artxq, struct ieee80211_txq,
				   drv_priv);

		if (--loop_max == 0) {
			ath10k_err(ar, "Looped 2000 times in tx_push_pending, bailing out.\n");
			break;
		}
		
		/* Prevent aggressive sta/tid taking over tx queue */
		max = 16;
		ret = 0;
		while (ath10k_mac_tx_can_push(hw, txq) && max--) {
			ret = ath10k_mac_tx_push_txq(hw, txq);
			if (ret < 0)
				break;
		}

		list_del_init(&artxq->list);
		if (ret != -ENOENT)
			list_add_tail(&artxq->list, &ar->txqs);
		else if (artxq == last)
			last = list_last_entry(&ar->txqs, struct ath10k_txq, list);

		ath10k_htt_tx_txq_update(hw, txq);

		if (artxq == last || (ret < 0 && ret != -ENOENT))
			break;
	}

	rcu_read_unlock();
	spin_unlock_bh(&ar->txqs_lock);
}

The crash I get is this:


ath10k_pci 0000:05:00.0: firmware crashed! (uuid 2a118708-977d-43d6-8d40-079ddec99eb3)
ath10k_pci 0000:05:00.0: firmware register dump:
ath10k_pci 0000:05:00.0: [00]: 0x00000009 0x000015B3 0x0099E4B6 0x00955B31
ath10k_pci 0000:05:00.0: [04]: 0x0099E4B6 0x00060130 0x00000005 0x00000016
ath10k_pci 0000:05:00.0: [08]: 0x00455030 0x004402B0 0x004060F0 0x00000007
ath10k_pci 0000:05:00.0: [12]: 0x00000009 0x00000000 0x009533D0 0x009533DF
ath10k_pci 0000:05:00.0: [16]: 0x00953438 0x0A00286E 0x009406B6 0x00000000
ath10k_pci 0000:05:00.0: [20]: 0x4099E4B6 0x00405FEC 0x000000BE 0x00955A00
ath10k_pci 0000:05:00.0: [24]: 0x8099E680 0x0040604C 0x00000000 0xC099E4B6
ath10k_pci 0000:05:00.0: [28]: 0x80986D5F 0x004060AC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [32]: 0x80984E51 0x004060CC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [36]: 0x80985CBF 0x004060EC 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [40]: 0x809CAE6A 0x0040615C 0x004402B0 0x00424654
ath10k_pci 0000:05:00.0: [44]: 0x80984EBC 0x0040618C 0x004402B0 0x0040623C
ath10k_pci 0000:05:00.0: [48]: 0x809CB3CC 0x0040623C 0x004402B0 0x00411988
ath10k_pci 0000:05:00.0: [52]: 0x80984DE0 0x0040626C 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [56]: 0x809CCE08 0x0040635C 0x00424654 0x00423234
ath10k_pci 0000:05:00.0: ath10k_pci ATH10K_DBG_BUFFER:
ath10k: [0000]: 0001854A 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 0001854E
ath10k: [0008]: 07FC4C02 00000004 0001854F 0060581D 0001854F 17FC4C01 0F00851C 0000000A
ath10k: [0016]: 06003007 0000FFAA FFFFFFFF 0001854F 17FC4C01 71108880 00000000 00C400BF
ath10k: [0024]: 00000000 00000FF0 0001854F 17FC4C01 71108880 00010000 00C400BF 00000000
ath10k: [0032]: FFFFFFFF 0001854F 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF
ath10k: [0040]: 0001854F 17FC4C01 71108880 00030000 00C400BF 000000FF FFFFFFFF 0001854F
ath10k: [0048]: 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF 0001854F 17FC4C01
ath10k: [0056]: 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018550 0060581D 00018550
ath10k: [0064]: 0860581B 0000851C 00000000 00018550 0060581D 00018550 07FC4C02 00000004
ath10k: [0072]: 00018551 0060581D 00018551 17FC4C01 0F00851C 0000000A 06003007 0000FFAA
ath10k: [0080]: FFFFFFFF 00018551 17FC4C01 71108880 00000000 00C400BF 00000000 00000FF0
ath10k: [0088]: 00018551 17FC4C01 71108880 00010000 00C400BF 00000000 FFFFFFFF 00018551
ath10k: [0096]: 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF 00018551 17FC4C01
ath10k: [0104]: 71108880 00030000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880
ath10k: [0112]: 00040000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880 00050000
ath10k: [0120]: 00C400BF 000000FF FBFFFFFF 00018551 14605853 51100001 000F0DE4 00000400
ath10k: [0128]: 00000056 00440380 00018551 0060581D 00018551 0460581C 00000001 00018551
ath10k: [0136]: 0060581D 00018551 07FC4C02 00000004 00018552 0060581D 00018552 17FC4C01
ath10k: [0144]: 0F00851C 0000000A 06003007 0000FFAA FFFFFFFF 00018553 17FC4C01 71108880
ath10k: [0152]: 00000000 00C400BF 00000000 00000FF0 00018553 17FC4C01 71108880 00010000
ath10k: [0160]: 00C400BF 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00020000 00C400BF
ath10k: [0168]: 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00030000 00C400BF 000000FF
ath10k: [0176]: FFFFFFFF 00018553 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF
ath10k: [0184]: 00018553 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018553
ath10k: [0192]: 07FC4C02 00000001 00018553 07FC4C02 00000001 00018553 0BFC5826 000005E9
ath10k: [0200]: 00000003 00018554 0BFC5822 0000C01D 00000406 00018578 08383812 000F45C4
ath10k: [0208]: 00424654 00018578 10383809 0000143C 00000001 00000000 00000000 0001857B
ath10k: [0216]: 14385853 51100001 000F0D9C 000003FC 00000057 004402B0 0001857B 14385853
ath10k: [0224]: 51100001 000F0D54 000003FE 00000058 004402B0 0001857B 07FC5830 00000008
ath10k: [0232]: 0001857B 14385854 51100002 000F0D54 00000061 00000057 004402B0 0001857B
ath10k: [0240]: 14385851 91107001 00424654 004402B0 00000008 00000006 0001857B 17FC5855
ath10k: [0248]: 91108001 00000000 00000000 00000007 000000BE 0001857B 0FFC5855 91108002
ath10k: [0256]: 004402B0 00000010 0001857B 17FC0001 0099E4B6 000015B3 000015B3 00405EDC
ath10k: [0264]: 00000009
ath10k_pci 0000:05:00.0: ATH10K_END
sta13: drv-set-bitrate-mask had error return: -108
rdev-set-bitrate-mask failed: -108
wlan3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta22: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta0: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta1: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta2: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
PGD 0
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan wanlink(O) pktgen 
rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt iTCO_vendor_support ath9k ath10k_pci coretemp ath9k_common hwmon intel_rapl ath10k_core iosf_mbi ath9k_hw 
x86_pkg_temp_thermal intel_powerclamp kvm_intel ath joydev kvm mac80211 irqbypass serio_raw pcspkr cfg80211 i2c_i801 lpc_ich snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp 
soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic i915 pata_acpi i2c_algo_bit drm_kms_helper e1000e ptp pps_core drm i2c_core video 
fjes ipv6 [last unloaded: nf_conntrack]
CPU: 2 PID: 581 Comm: kworker/u8:4 Tainted: G        W  O    4.4.6+ #21
Hardware name: To be filled by O.E.M. To be filled by O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
Workqueue: phy2 ieee80211_iface_work [mac80211]
task: ffff8800d9c90000 ti: ffff880213fd0000 task.ti: ffff880213fd0000
RIP: 0010:[<ffffffffa08e9810>]  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
RSP: 0018:ffff88021eb03c28  EFLAGS: 00010296
RAX: ffff8800cbfd7000 RBX: ffff8800cbfd5060 RCX: ffff8800cbfd1000
RDX: 0000000000001000 RSI: 00000000d9c90805 RDI: ffff8800cbfd5000
RBP: ffff88021eb03c28 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88021eb03ba8 R11: ffff8800cbfd5030 R12: ffff8800cbfd5060
R13: ffff880214a34902 R14: ffff8800cbfd5018 R15: ffff88021350e1b0
FS:  0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001000 CR3: 0000000001c0a000 CR4: 00000000000406e0
Stack:
  ffff88021eb03c68 ffffffffa08e985a ffff880214a30a60 ffff880214a35600
  ffff8800cbfd5060 ffff880214a349e0 ffff880214a35430 ffff88021350e1b0
  ffff88021eb03cb8 ffffffffa0ec2bb4 ffff880214a30a60 0000000014a30a60
Call Trace:
  <IRQ>
  [<ffffffffa08e985a>] ieee80211_tx_dequeue+0x41/0xfe [mac80211]
  [<ffffffffa0ec2bb4>] ath10k_mac_tx_push_txq+0x6a/0x13b [ath10k_core]
  [<ffffffffa0ec2ddb>] ath10k_mac_tx_push_pending+0x156/0x16b [ath10k_core]
  [<ffffffffa0ed123d>] ath10k_htt_t2h_msg_handler+0x7d9/0x886 [ath10k_core]
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fca532>] ? ath10k_pci_hif_send_complete_check+0x5d/0x5d [ath10k_pci]
  [<ffffffffa0fca557>] ath10k_pci_htt_rx_deliver+0x25/0x2a [ath10k_pci]
  [<ffffffffa0fcbb51>] ath10k_pci_process_rx_cb+0x191/0x1c9 [ath10k_pci]
  [<ffffffff810f23ad>] ? __local_bh_enable_ip+0xa4/0xb9
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fcbbbf>] ath10k_pci_htt_rx_cb+0x24/0x27 [ath10k_pci]
  [<ffffffffa0fce1be>] ath10k_ce_per_engine_service+0x64/0xa0 [ath10k_pci]
  [<ffffffffa0fce260>] ath10k_ce_per_engine_service_any+0x66/0x74 [ath10k_pci]
  [<ffffffffa0fcc4b3>] ath10k_pci_tasklet+0x3a/0x4e [ath10k_pci]
  [<ffffffff810f29e0>] tasklet_action+0xc0/0xcf
  [<ffffffff810f1ff6>] __do_softirq+0x1a4/0x407
  [<ffffffff810f2462>] irq_exit+0x40/0x94
  [<ffffffff810134a2>] do_IRQ+0xd5/0xed
  [<ffffffff816fb24c>] common_interrupt+0x8c/0x8c
  <EOI>
  [<ffffffff81129d49>] ? arch_local_irq_restore+0x6/0xd
  [<ffffffff816f8a3a>] __mutex_unlock_slowpath+0x120/0x137
  [<ffffffff816f8a5a>] mutex_unlock+0x9/0xb
  [<ffffffffa0ebcc38>] ath10k_conf_tx+0x3a9/0x3bb [ath10k_core]
  [<ffffffffa08c2b48>] drv_conf_tx+0x140/0x202 [mac80211]
  [<ffffffffa08f3072>] ieee80211_set_wmm_default+0x1fb/0x24a [mac80211]
  [<ffffffffa0908bc5>] ieee80211_set_disassoc+0x248/0x31f [mac80211]
  [<ffffffffa0908ccf>] ieee80211_sta_connection_lost+0x33/0x69 [mac80211]
  [<ffffffffa090bb8f>] ieee80211_sta_work+0x5fc/0xda9 [mac80211]
  [<ffffffff8112d30b>] ? mark_held_locks+0x5e/0x74
  [<ffffffff8112d490>] ? trace_hardirqs_on_caller+0x16f/0x18b
  [<ffffffff816fa024>] ? _raw_spin_unlock_irqrestore+0x48/0x5d
  [<ffffffffa08d54bd>] ieee80211_iface_work+0x335/0x34e [mac80211]
  [<ffffffff8110471a>] process_one_work+0x260/0x4db
  [<ffffffff81104e50>] worker_thread+0x1e9/0x29b
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81109bfb>] kthread+0xcf/0xd7
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
  [<ffffffff816faaef>] ret_from_fork+0x3f/0x70
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
Code: 55 48 89 e5 48 39 c7 74 27 48 85 c0 74 24 ff 4f 10 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 48 89 51 08 <48> 89 0a eb 02 31 c0 5d 
c3 55 48 89 e5 41 57 41 56 4c 8d 76 b8
RIP  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
  RSP <ffff88021eb03c28>
CR2: 0000000000001000
---[ end trace eb4cdb33d766b5f3 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..

Thanks,
Ben

>
>
> Michał
>


-- 
Ben Greear <greearb at candelatech.com>
Candela Technologies Inc  http://www.candelatech.com




More information about the ath10k mailing list