[PATCH] ath11k: enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS for WCN6855

Wen Gong quic_wgong at quicinc.com
Mon Dec 6 20:35:04 PST 2021


On 12/7/2021 4:03 AM, Sven Eckelmann wrote:
> On Monday, 6 December 2021 08:10:40 CET Wen Gong wrote:
>>> On Monday, 6 December 2021 04:29:39 CET Wen Gong wrote:
>>> [...]
>>>> I did test in my setup, not see the crash.
>>>>
>>>> I am afraid you also need this patch("ath11k: change to use dynamic
>>>> memory for channel list of scan",
>>>>
>>>> https://patchwork.kernel.org/project/linux-wireless/patch/20211129110939.15711-1-quic_wgong@quicinc.com
>>>> )
>>>>
>>>> Could you apply this patch and try again?
>>> Tried it and I see the same problem.
>> Could you tell what is your test steps?
> Start kernel with commit a93789ae541c ("ath11k: Avoid NULL ptr
> access during mgmt tx cleanup") + patches:
>
> * ath11k: enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS for WCN6855
> * ath11k: change to use dynamic memory for channel list of scan
>
> You can find the config in the first mail. But I have now enabled KASAN inline
> to hopefully create some better error messages.
>
> The firmware + board data (see mail "ath11k: incorrect board_id retrieval")
> was prepared like this:
>
>     git clone https://github.com/kvalo/ath11k-firmware /root/ath11k-firmware
>     mkdir -p /lib/firmware/ath11k/WCN6855/hw2.0/
>     cp /root/ath11k-firmware/WCN6855/hw2.0/*.bin /lib/firmware/ath11k/WCN6855/hw2.0/
>     cp /root/ath11k-firmware/WCN6855/hw2.0/1.1/WLAN.HSP.1.1-02892.1-QCAHSPSWPL_V1_V2_SILICONZ_LITE-1/*.bin /lib/firmware/ath11k/WCN6855/hw2.0/
>
>     git clone https://github.com/qca/qca-swiss-army-knife /root/qca-swiss-army-knife
>     apt install python2
>     python2 /root/qca-swiss-army-knife/tools/scripts/ath11k/ath11k-bdencoder  -e /lib/firmware/ath11k/WCN6855/hw2.0/board-2.bin
>     rm /lib/firmware/ath11k/WCN6855/hw2.0/board-2.bin
>     cp 'bus=pci,vendor=17cb,device=1103,subsystem-vendor=17cb,subsystem-device=3374,qmi-chip-id=2,qmi-board-id=266.bin' /lib/firmware/ath11k/WCN6855/hw2.0/board.bin
>
> Then I am just starting up the device as usual, and start wpa_supplicant (with
> defconfig + CONFIG_MESH=y) from commit 14ab4a816c68 ("Reject
> ap_vendor_elements if its length is odd")
>
>      cat << "EOF" > station_test.cfg
>      network={
>        ssid="MyTestAP"
>        key_mgmt=WPA-PSK FT-PSK
>        proto=RSN
>        psk="testtest"
>      }
>      EOF
>      ip link set up dev wlp6s0
>      ~/hostap/wpa_supplicant/wpa_supplicant -D nl80211 -i wlp6s0 -c station_test.cfg
>
> The actual SSID + PSK is valid and multiple access points (4) have this BSS on
> 2.4GHz + 5GHz.
>
> So you are basically always calling dev_kfree_skb_any in ath11k_ce_tx_process_cb
> because wcn6855 hw2.0 has credit_flow has set. But it seems like one of the
> entries returned by ath11k_ce_completed_send_next is bogus and causes this
> problems during the ath11k_ce_tx_process_cb. And for some reason, this is
> triggered here by this firmware feature.
>
>      ./scripts/faddr2line --list vmlinux consume_skb+0x9f/0x1c0
>      consume_skb+0x9f/0x1c0:
>      
>      __kfree_skb at net/core/skbuff.c:757
>       752     */
>       753
>       754    void __kfree_skb(struct sk_buff *skb)
>       755    {
>       756            skb_release_all(skb);
>      >757<           kfree_skbmem(skb);
>       758    }
>       759    EXPORT_SYMBOL(__kfree_skb);
>       760
>       761    /**
>       762     *      kfree_skb - free an sk_buff
>      
>      (inlined by) consume_skb at net/core/skbuff.c:912
>       907    {
>       908            if (!skb_unref(skb))
>       909                    return;
>       910
>       911            trace_consume_skb(skb);
>      >912<           __kfree_skb(skb);
>       913    }
>       914    EXPORT_SYMBOL(consume_skb);
>       915    #endif
>       916
>       917    /**
>      
>      (inlined by) consume_skb at net/core/skbuff.c:906
>       901     *
>       902     *      Drop a ref to the buffer and free it if the usage count has hit zero
>       903     *      Functions identically to kfree_skb, but kfree_skb assumes that the frame
>       904     *      is being dropped after a failure and notes that
>       905     */
>      >906<   void consume_skb(struct sk_buff *skb)
>       907    {
>       908            if (!skb_unref(skb))
>       909                    return;
>       910
>       911            trace_consume_skb(skb);
>
>
>      ./scripts/faddr2line --list vmlinux skb_release_data+0x1b0/0x5c0
>      skb_release_data+0x1b0/0x5c0:
>      
>      skb_zcopy_clear at include/linux/skbuff.h:1549
>       1544   {
>       1545           struct ubuf_info *uarg = skb_zcopy(skb);
>       1546
>       1547           if (uarg) {
>       1548                   if (!skb_zcopy_is_nouarg(skb))
>      >1549<                          uarg->callback(skb, uarg, zerocopy_success);
>       1550
>       1551                   skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
>       1552           }
>       1553   }
>       1554
>      
>      (inlined by) skb_release_data at net/core/skbuff.c:669
>       664            if (skb->cloned &&
>       665                atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>       666                                  &shinfo->dataref))
>       667                    goto exit;
>       668
>      >669<           skb_zcopy_clear(skb, true);
>       670
>       671            for (i = 0; i < shinfo->nr_frags; i++)
>       672                    __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
>       673
>       674            if (shinfo->frag_list)
>
> But I didn't like the inlined code. So I've changed the compilation flags
> slightly:
>
>      diff --git a/net/core/Makefile b/net/core/Makefile
>      index 6bdcb2cafed8..5eda226c5f27 100644
>      --- a/net/core/Makefile
>      +++ b/net/core/Makefile
>      @@ -37,3 +37,4 @@ obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
>       obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
>       obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
>       obj-$(CONFIG_OF)	+= of_net.o
>      +ccflags-y += -fno-inline -O1 -fno-optimize-sibling-calls
>
> Now the stacktrace is a lot more readable. And the returned
> crash location makes a lot more sense:
>
>      ./scripts/faddr2line --list vmlinux 'skb_zcopy_clear+0x34/0x8f'
>      skb_zcopy_clear+0x34/0x8f:
>      
>      skb_zcopy_clear at include/linux/skbuff.h:1549
>       1544   {
>       1545           struct ubuf_info *uarg = skb_zcopy(skb);
>       1546
>       1547           if (uarg) {
>       1548                   if (!skb_zcopy_is_nouarg(skb))
>      >1549<                          uarg->callback(skb, uarg, zerocopy_success);
>       1550
>       1551                   skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
>       1552           }
>       1553   }
>       1554
>
> Or with the assembler:
>
>       (gdb) disassemble /m *(skb_zcopy_clear+0x34/0x8f)
>       Dump of assembler code for function skb_zcopy_clear:
>       1544    {
>          0x000000000000072a <+0>:     push   %r12
>          0x000000000000072c <+2>:     push   %rbp
>          0x000000000000072d <+3>:     push   %rbx
>          0x000000000000072e <+4>:     mov    %rdi,%rbx
>          0x0000000000000731 <+7>:     mov    %esi,%r12d
>       
>       1545            struct ubuf_info *uarg = skb_zcopy(skb);
>          0x0000000000000734 <+10>:    call   0x5d3 <skb_zcopy>
>       
>       1546
>       1547            if (uarg) {
>          0x0000000000000739 <+15>:    test   %rax,%rax
>          0x000000000000073c <+18>:    je     0x7a0 <skb_zcopy_clear+118>
>          0x000000000000073e <+20>:    mov    %rax,%rbp
>       
>       1548                    if (!skb_zcopy_is_nouarg(skb))
>          0x0000000000000741 <+23>:    mov    %rbx,%rdi
>          0x0000000000000744 <+26>:    call   0x6f6 <skb_zcopy_is_nouarg>
>          0x0000000000000749 <+31>:    test   %al,%al
>          0x000000000000074b <+33>:    jne    0x777 <skb_zcopy_clear+77>
>       
>       1549                            uarg->callback(skb, uarg, zerocopy_success);
>          0x000000000000074d <+35>:    mov    %rbp,%rdx
>          0x0000000000000750 <+38>:    shr    $0x3,%rdx
>          0x0000000000000754 <+42>:    movabs $0xdffffc0000000000,%rax
>          0x000000000000075e <+52>:    cmpb   $0x0,(%rdx,%rax,1)
>          0x0000000000000762 <+56>:    jne    0x7a5 <skb_zcopy_clear+123>
>          0x0000000000000764 <+58>:    movzbl %r12b,%edx
>          0x0000000000000768 <+62>:    mov    0x0(%rbp),%rax
>          0x000000000000076c <+66>:    mov    %rbp,%rsi
>          0x000000000000076f <+69>:    mov    %rbx,%rdi
>          0x0000000000000772 <+72>:    call   0x777 <skb_zcopy_clear+77>
>          0x00000000000007a5 <+123>:   mov    %rbp,%rdi
>          0x00000000000007a8 <+126>:   call   0x7ad <skb_zcopy_clear+131>
>          0x00000000000007ad <+131>:   jmp    0x764 <skb_zcopy_clear+58>
>       
>       1550
>       1551                    skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
>          0x0000000000000777 <+77>:    mov    %rbx,%rdi
>          0x000000000000077a <+80>:    call   0x518 <skb_end_pointer>
>          0x000000000000077f <+85>:    mov    %rax,%rbx
>          0x0000000000000782 <+88>:    mov    %rax,%rdx
>          0x0000000000000785 <+91>:    shr    $0x3,%rdx
>          0x0000000000000789 <+95>:    movabs $0xdffffc0000000000,%rax
>          0x0000000000000793 <+105>:   movzbl (%rdx,%rax,1),%eax
>          0x0000000000000797 <+109>:   test   %al,%al
>          0x0000000000000799 <+111>:   je     0x79d <skb_zcopy_clear+115>
>          0x000000000000079b <+113>:   jle    0x7af <skb_zcopy_clear+133>
>          0x000000000000079d <+115>:   andb   $0xf8,(%rbx)
>          0x00000000000007af <+133>:   mov    %rbx,%rdi
>          0x00000000000007b2 <+136>:   call   0x7b7 <skb_zcopy_clear+141>
>          0x00000000000007b7 <+141>:   jmp    0x79d <skb_zcopy_clear+115>
>       
>       1552            }
>       1553    }
>          0x00000000000007a0 <+118>:   pop    %rbx
>          0x00000000000007a1 <+119>:   pop    %rbp
>          0x00000000000007a2 <+120>:   pop    %r12
>          0x00000000000007a4 <+122>:   ret
>       
>       End of assembler dump.
>
> To make it even easier to read, just disable the inline KASAN and reduce the
> optimization level for this for it:
>
>      diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>      index 059b6266dcd7..819cc58ab051 100644
>      --- a/include/linux/skbuff.h
>      +++ b/include/linux/skbuff.h
>      @@ -1540,6 +1540,8 @@ static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref)
>       }
>       
>       /* Release a reference on a zerocopy structure */
>      +#pragma GCC push_options
>      +#pragma GCC optimize ("O0")
>       static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
>       {
>       	struct ubuf_info *uarg = skb_zcopy(skb);
>      @@ -1551,6 +1553,7 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
>       		skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
>       	}
>       }
>      +#pragma GCC pop_options
>       
>       static inline void skb_mark_not_on_list(struct sk_buff *skb)
>       {
>
> This creates this nice, unoptimized function which crashes at +63:
>
>      $ gdb net/core/skbuff.o -q
>      Reading symbols from net/core/skbuff.o...
>      (gdb) disassemble /m *(skb_zcopy_clear+0x3f/0x70)
>      Dump of assembler code for function skb_zcopy_clear:
>      1546    {
>         0x0000000000000000 <+0>:     push   %rbp
>         0x0000000000000001 <+1>:     mov    %rsp,%rbp
>         0x0000000000000004 <+4>:     sub    $0x18,%rsp
>         0x0000000000000008 <+8>:     mov    %rdi,-0x10(%rbp)
>         0x000000000000000c <+12>:    mov    %esi,%eax
>         0x000000000000000e <+14>:    mov    %al,-0x14(%rbp)
>      
>      1547            struct ubuf_info *uarg = skb_zcopy(skb);
>         0x0000000000000011 <+17>:    mov    -0x10(%rbp),%rax
>         0x0000000000000015 <+21>:    mov    %rax,%rdi
>         0x0000000000000018 <+24>:    call   0x29e <skb_zcopy>
>         0x000000000000001d <+29>:    mov    %rax,-0x8(%rbp)
>      
>      1548
>      1549            if (uarg) {
>         0x0000000000000021 <+33>:    cmpq   $0x0,-0x8(%rbp)
>         0x0000000000000026 <+38>:    je     0x6d <skb_zcopy_clear+109>
>      
>      1550                    if (!skb_zcopy_is_nouarg(skb))
>         0x0000000000000028 <+40>:    mov    -0x10(%rbp),%rax
>         0x000000000000002c <+44>:    mov    %rax,%rdi
>         0x000000000000002f <+47>:    call   0x2df <skb_zcopy_is_nouarg>
>         0x0000000000000034 <+52>:    xor    $0x1,%eax
>         0x0000000000000037 <+55>:    test   %al,%al
>         0x0000000000000039 <+57>:    je     0x59 <skb_zcopy_clear+89>
>      
>      1551                            uarg->callback(skb, uarg, zerocopy_success);
>         0x000000000000003b <+59>:    mov    -0x8(%rbp),%rax
>         0x000000000000003f <+63>:    mov    (%rax),%r8
>         0x0000000000000042 <+66>:    movzbl -0x14(%rbp),%edx
>         0x0000000000000046 <+70>:    mov    -0x8(%rbp),%rcx
>         0x000000000000004a <+74>:    mov    -0x10(%rbp),%rax
>         0x000000000000004e <+78>:    mov    %rcx,%rsi
>         0x0000000000000051 <+81>:    mov    %rax,%rdi
>         0x0000000000000054 <+84>:    call   0x59 <skb_zcopy_clear+89>
>      
>      1552
>      1553                    skb_shinfo(skb)->flags &= ~SKBFL_ALL_ZEROCOPY;
>         0x0000000000000059 <+89>:    mov    -0x10(%rbp),%rax
>         0x000000000000005d <+93>:    mov    %rax,%rdi
>         0x0000000000000060 <+96>:    call   0x27f <skb_end_pointer>
>         0x0000000000000065 <+101>:   movzbl (%rax),%edx
>         0x0000000000000068 <+104>:   and    $0xfffffff8,%edx
>         0x000000000000006b <+107>:   mov    %dl,(%rax)
>      
>      1554            }
>      1555    }
>         0x000000000000006d <+109>:   nop
>         0x000000000000006e <+110>:   leave
>         0x000000000000006f <+111>:   ret
>      
>      End of assembler dump.
>
> The question now: What is causing the unclean state of the skb and thus
> doesn't let it get rejected by skb_zcopy_is_nouarg before the uarg
> callback is tried.
>
> Kind regards,
> 	Sven

Thanks Sven a lot for your analyze.

I still can not reproduce it.

I think it is because the write over skb->tail in scan, because the 
invalid address

is same for each crash(0x408210000b231a/0xe0080c4200016463), and it is 
caused by this instruction

"0x000000000000003f <+63>:    mov    (%rax),%r8" which is assign the value of uarg->callback to %r8.

Could you add below change?

It will print the log to help us find out the bug.

diff --git a/drivers/net/wireless/ath/ath11k/mac.c 
b/drivers/net/wireless/ath/ath11k/mac.c
index 26181f237e23..2147f74f5ebf 100644
--- a/drivers/net/wireless/ath/ath11k/mac.c
+++ b/drivers/net/wireless/ath/ath11k/mac.c
@@ -3421,12 +3421,15 @@ static int ath11k_mac_op_hw_scan(struct 
ieee80211_hw *hw,
                 memcpy(arg.extraie.ptr, req->ie, req->ie_len);
         }

+       ath11k_info(ar->ab, "n_ssids %d\n", req->n_ssids);
+
         if (req->n_ssids) {
                 arg.num_ssids = req->n_ssids;
                 for (i = 0; i < arg.num_ssids; i++) {
                         arg.ssid[i].length  = req->ssids[i].ssid_len;
                         memcpy(&arg.ssid[i].ssid, req->ssids[i].ssid,
                                req->ssids[i].ssid_len);
+                       ath11k_info(ar->ab, "ssid[%d] len %d\n", i, 
arg.ssid[i].length);
                 }
         } else {
                 arg.scan_flags |= WMI_SCAN_FLAG_PASSIVE;
diff --git a/drivers/net/wireless/ath/ath11k/wmi.c 
b/drivers/net/wireless/ath/ath11k/wmi.c
index 7d7f76d4bf1f..e42a64251799 100644
--- a/drivers/net/wireless/ath/ath11k/wmi.c
+++ b/drivers/net/wireless/ath/ath11k/wmi.c
@@ -2271,6 +2271,7 @@ int ath11k_wmi_send_scan_start_cmd(struct ath11k *ar,
                 }
         }

+       ath11k_info(ar->ab, "%s ptr %px skb data %px len %d over %d", 
__func__, ptr, skb->data, skb->len, ((unsigned char 
*)ptr)-skb->data-skb->len);
         ret = ath11k_wmi_cmd_send(wmi, skb,
                                   WMI_START_SCAN_CMDID);
         if (ret) {




More information about the ath11k mailing list