ath11k: QCA6390 on Dell XPS 13 and kernel crashes

wi nk wink at technolu.st
Sun Dec 6 20:17:08 EST 2020


On Sun, Dec 6, 2020 at 10:45 PM wi nk <wink at technolu.st> wrote:
>
> On Sun, Dec 6, 2020 at 6:53 PM wi nk <wink at technolu.st> wrote:
> >
> > On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > <mail at mitchellnordine.com> wrote:
> > >
> > > I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > >
> > > FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > >
> > > Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > >
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Sunday, December 6, 2020 6:00 PM, <ath11k-request at lists.infradead.org> wrote:
> > >
> > > > Send ath11k mailing list submissions to
> > > > ath11k at lists.infradead.org
> > > >
> > > > To subscribe or unsubscribe via the World Wide Web, visit
> > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > > or, via email, send a message with subject or body 'help' to
> > > > ath11k-request at lists.infradead.org
> > > >
> > > > You can reach the person managing the list at
> > > > ath11k-owner at lists.infradead.org
> > > >
> > > > When replying, please edit your Subject line so it is more specific
> > > > than "Re: Contents of ath11k digest..."
> > > >
> > > > Today's Topics:
> > > >
> > > > 1.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > 2.  Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > >
> > > >
> > > > Message: 1
> > > > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > > From: wi nk wink at technolu.st
> > > > To: Kalle Valo kvalo at codeaurora.org
> > > > Cc: Thomas Krause thomaskrause at posteo.de, ath11k at lists.infradead.org
> > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > Message-ID:
> > > > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A at mail.gmail.com
> > > > Content-Type: text/plain; charset="UTF-8"
> > > >
> > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink at technolu.st wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink at technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo at codeaurora.org wrote:
> > > > > >
> > > > > > > Hi Wi and Thomas,
> > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > about the kernel crashes on this thread.
> > > > > > > Here's what I have understood so far:
> > > > > > >
> > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > >     with 32 MSI vectors.
> > > > > > >
> > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > >
> > > > > > >
> > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > >
> > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > >     device for a while.
> > > > > > >
> > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > >     issues.
> > > > > > >
> > > > > > >
> > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > About the symptoms Wi reports:
> > > > > > >
> > > > > > > So up until this point, everything is working without issues.
> > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > the crash states I will see this:
> > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > timestamp):
> > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > >
> > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > >
> > > > > > > -------------------------------------
> > > > > > >
> > > > > > > And Thomas Krause reports:
> > > > > > >
> > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > (sometimes even before showing the login screen).
> > > > > > >
> > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > >
> > > > > > > --
> > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > >
> > > > > > Hi Kalle,
> > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > to see if there are any differences.
> > > > > > Thanks!
> > > > >
> > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > message as the adapter was coming up/associating, shortly after the
> > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > reboot to bring the adapter back.
> > > >
> > > > Kalle -
> > > >
> > > > I've noticed one additional behavior that may give someone with
> > > > familiarity with the QCA hardware a clue. I'm running
> > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > throttling issue as the ath11k driver is initializing / associating.
> > > > The bluetooth system is using the btqca driver. I don't have any
> > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > other than tracking some simple statistics. I booted my system 20
> > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > into X and manually modprobing the ath11k driver. The difference is
> > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > headphones are paired and I received the throttling message and
> > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > but it's reliably reproducible for me, is there anything useful I can
> > > > collect?
> > > >
> > > >
> > > > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > Message: 2
> > > > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > > From: wi nk wink at technolu.st
> > > > To: Kalle Valo kvalo at codeaurora.org
> > > > Cc: Thomas Krause thomaskrause at posteo.de, ath11k at lists.infradead.org
> > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > Message-ID:
> > > > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q at mail.gmail.com
> > > > Content-Type: text/plain; charset="UTF-8"
> > > >
> > > > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink at technolu.st wrote:
> > > >
> > > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink at technolu.st wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink at technolu.st wrote:
> > > > > >
> > > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo at codeaurora.org wrote:
> > > > > > >
> > > > > > > > Hi Wi and Thomas,
> > > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > > about the kernel crashes on this thread.
> > > > > > > > Here's what I have understood so far:
> > > > > > > >
> > > > > > > > -   On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > > >     with 32 MSI vectors.
> > > > > > > >
> > > > > > > > -   On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > > >
> > > > > > > >
> > > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > > >
> > > > > > > > -   Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > > >     13. We added a hack to ath11k make it work with only vector and after
> > > > > > > >     that it's possible to boot the firmware, connect to the AP and use the
> > > > > > > >     device for a while.
> > > > > > > >
> > > > > > > > -   But the problem now is that the kernel is crashing almost immediately
> > > > > > > >     and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > > >     13, all other systems (including Dell XPS 15) seem to work without
> > > > > > > >     issues.
> > > > > > > >
> > > > > > > >
> > > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > > About the symptoms Wi reports:
> > > > > > > >
> > > > > > > > So up until this point, everything is working without issues.
> > > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > > the crash states I will see this:
> > > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > > timestamp):
> > > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > > >
> > > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > > >
> > > > > > > > -------------------------------------
> > > > > > > >
> > > > > > > > And Thomas Krause reports:
> > > > > > > >
> > > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > > (sometimes even before showing the login screen).
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > > > > > >
> > > > > > > > --
> > > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > > >
> > > > > > > Hi Kalle,
> > > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > > to see if there are any differences.
> > > > > > > Thanks!
> > > > > >
> > > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > > message as the adapter was coming up/associating, shortly after the
> > > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > > reboot to bring the adapter back.
> > > > >
> > > > > Kalle -
> > > > > I've noticed one additional behavior that may give someone with
> > > > > familiarity with the QCA hardware a clue. I'm running
> > > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > > throttling issue as the ath11k driver is initializing / associating.
> > > > > The bluetooth system is using the btqca driver. I don't have any
> > > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > > other than tracking some simple statistics. I booted my system 20
> > > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > > into X and manually modprobing the ath11k driver. The difference is
> > > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > > headphones are paired and I received the throttling message and
> > > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > > but it's reliably reproducible for me, is there anything useful I can
> > > > > collect?
> > > >
> > > > Well unfortunately I think the bluetooth was just a red herring in the
> > > > racing. To chase that, I disabled all bluetooth and was able to get
> > > > into a state where I had 6 failed boots in a row. To further poke
> > > > around, I rebuilt the kernel with localmodconfig to disable building
> > > > big chunks of things. This kernel is way less stable and seems to
> > > > freeze most of the time (but does occasionally remain stable), I'm not
> > > > sure what else got disabled in there, but it seems to have had a
> > > > negative impact on the crash racing.
> > > >
> > > >
> > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > Subject: Digest Footer
> > > >
> > > > ath11k mailing list
> > > > ath11k at lists.infradead.org
> > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > >
> > > >
> > > > -----------------------------------------------------------------------------------------------------------------------------
> > > >
> > > > End of ath11k Digest, Vol 7, Issue 5
> > >
> > >
> > >
> > > --
> > > ath11k mailing list
> > > ath11k at lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> >
> > Hey Mitchell,
> >
> >    One more thing to try that may help us get a little bit of extra
> > info.  Out of everything I've done, something that has remained
> > consistent is to enable the MHI debugging as Kalle suggested:
> >
> > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> >
> >   Before any crash/spinlock, I see the MHI printing (from
> > drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > and then after a number more iterations through this function, things
> > finally go out of control.  So from
> >
> >         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
> >                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
> >                 TO_MHI_STATE_STR(state));
> >
> > I'll see something like this:
> >
> > [  312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > [  313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > ee:INVALID_EE dev_state:SYS_ERR
> >
> > Then after a few of those prints showing SYS_ERR, either a spinlock or
> > a firmware crash.  I'm not sure what causes this ee state to go
> > invalid, but maybe that's worth looking into.  Can you confirm the
> > same behavior?  To see this a little easier, I also run dmesg -wH in
> > two windows, one piping to | grep -v mhi (to filter out the mhi
> > debugging).
> >
> > Thanks!
>
> So I've managed to stabilise my system now, so either the race is
> gone, or I've done something to win it all the time.  So one of the
> avenues of racing I was chasing at first was in the ath11k driver
> itself.  There are a couple areas where the single/shared IRQ is being
> forcibly toggled in ways that the documentation says are not great
> (and the original patch was trying to avoid).  Fixing those didn't
> seem to have much impact on the stability of things (I've included
> those changes in my patch though).  After the last email I was
> thinking about the MHI side of things a bit more and found a number of
> call sites that my naive grepping had missed that do the same thing,
> but via acquiring a lock at the same time.  I modified all the calls
> to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> variants that accept the flags parameter to capture state.  I've now
> booted and loaded the driver 10+ times without a single freeze or
> crash.  I'm not sure all of those modifications are necessary (ie:
> which things are re-entrant in this single interrupt operating mode vs
> which ones can use the simpler lock/unlock mechanisms), so I could use
> some advice/guidance there.
>
> Mitchell - if you want to grab this patch and try it, let me know how
> it goes and I can clean it up for the mailing list:
> https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> (apply to ath11k-qca6390-bringup-202011301608)

Blindly chasing the crashing, I've found one more probably relevant
piece of information.  As I was playing around trying to see if I had
actually stopped the racing, I noticed my battery was low.  I plugged
it in and immediately received the RT throttling crash. I've now tried
quite a bit, and on the battery I don't see the crashing.  I thought
maybe dynamic CPU clocking is changing some of the racing properties.
When I bring everything up on the battery and wait around a bit, once
I plug in the usb-c cable, within a few seconds it will often trigger
the RT throttling message.  I poked a little bit at some of the wifi
power management settings, specifically trying to disable them, but I
didn't seem to kick anything relevant yet.  I can essentially use the
power cable as a trigger for this race though..

Kalle - are you aware of anything that happens to the driver/adapter
when ac power shows up?  I think I see some power saving stuff in
wmi.c but I haven't gotten deep enough to know...



More information about the ath11k mailing list