ath11k: QCA6390 on Dell XPS 13 and kernel crashes
wi nk
wink at technolu.st
Wed Dec 9 04:43:23 EST 2020
On Wed, Dec 9, 2020 at 2:52 AM wi nk <wink at technolu.st> wrote:
>
> On Mon, Dec 7, 2020 at 6:01 PM wi nk <wink at technolu.st> wrote:
> >
> > On Mon, Dec 7, 2020 at 3:45 PM Mitchell Nordine
> > <mail at mitchellnordine.com> wrote:
> > >
> > > Thanks for sending through this patch Wink.
> > >
> > > I built and installed the ath11k-qca6390-bringup branch with your patch last night on my Dell XPS 13 9310 running NixOS. I have only run the patch 6 times. The startup sequence seems more reliable. I was able to successfully enable the adapter and connect to my router each time, however each time my system would eventually freeze a few minutes after. I noticed that mouse input would stutter for a moment before completely freezing.
> > >
> > > I tested on battery twice to check your theory w.r.t. power management, but did not notice any difference in behaviour.
> > >
> > > > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> > >
> > > I tried running this but haven't noticed any difference to the output I'm observing in `dmesg` or `journalctl`. There's a chance that there's another way I should be doing this on NixOS as most things including the kernel and its configuration are built and configured declaratively. I'll try and work this out next time I get the chance to have a longer testing session.
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > On Monday, December 7, 2020 2:17 AM, wi nk <wink at technolu.st> wrote:
> > >
> > > > On Sun, Dec 6, 2020 at 10:45 PM wi nk wink at technolu.st wrote:
> > > >
> > > > > On Sun, Dec 6, 2020 at 6:53 PM wi nk wink at technolu.st wrote:
> > > > >
> > > > > > On Sun, Dec 6, 2020 at 6:39 PM Mitchell Nordine
> > > > > > mail at mitchellnordine.com wrote:
> > > > > >
> > > > > > > I recently tried updating to the latest set of patches on `ath11k-qca6390-bringup`, and as expected the crashing still remains (XPS 13 9310 with the QCA6390). I'm finding it difficult to test any of the other behaviour (like improved suspend, etc) as I'm seeing crashes the vast majority of the time. Normally this occurs when the wifi first attempts to connect to a network. On the rare occasion where it does connect successfully, it appears to run smoothly for a seemingly random amount of time before spontaneously crashing and freezing the system. I haven't managed to identify any particular action that causes this.
> > > > > > > FWIW, I still haven't managed to enable Bluetooth in my kernel yet, so there's very little chance that it's contributing to the issue in my case. I think wi-nk's observation is correct that the Bluetooth impacting raciness they observed was just a coincidence.
> > > > > > > Let me know if there is anything else I can test to help, or any particular kinds of debugging output you would like to see and I'll give it a go next time I get the chance to test.
> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > > On Sunday, December 6, 2020 6:00 PM, ath11k-request at lists.infradead.org wrote:
> > > > > > >
> > > > > > > > Send ath11k mailing list submissions to
> > > > > > > > ath11k at lists.infradead.org
> > > > > > > > To subscribe or unsubscribe via the World Wide Web, visit
> > > > > > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > > > > > > or, via email, send a message with subject or body 'help' to
> > > > > > > > ath11k-request at lists.infradead.org
> > > > > > > > You can reach the person managing the list at
> > > > > > > > ath11k-owner at lists.infradead.org
> > > > > > > > When replying, please edit your Subject line so it is more specific
> > > > > > > > than "Re: Contents of ath11k digest..."
> > > > > > > > Today's Topics:
> > > > > > > >
> > > > > > > > 1. Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > > > > > 2. Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes (wi nk)
> > > > > > > >
> > > > > > > > Message: 1
> > > > > > > > Date: Sat, 5 Dec 2020 20:17:10 +0100
> > > > > > > > From: wi nk wink at technolu.st
> > > > > > > > To: Kalle Valo kvalo at codeaurora.org
> > > > > > > > Cc: Thomas Krause thomaskrause at posteo.de, ath11k at lists.infradead.org
> > > > > > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > > > > > Message-ID:
> > > > > > > > CAHUdJJX6JWbNY+=B2D1fFGZPqzbJSw0V0C2i+bZ=xabE56cv_A at mail.gmail.com
> > > > > > > > Content-Type: text/plain; charset="UTF-8"
> > > > > > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink at technolu.st wrote:
> > > > > > > >
> > > > > > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink at technolu.st wrote:
> > > > > > > > >
> > > > > > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo at codeaurora.org wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Wi and Thomas,
> > > > > > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > > > > > about the kernel crashes on this thread.
> > > > > > > > > > > Here's what I have understood so far:
> > > > > > > > > > >
> > > > > > > > > > > - On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > > > > > > with 32 MSI vectors.
> > > > > > > > > > >
> > > > > > > > > > > - On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > > > > > >
> > > > > > > > > > > - Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > > > > > > 13. We added a hack to ath11k make it work with only vector and after
> > > > > > > > > > > that it's possible to boot the firmware, connect to the AP and use the
> > > > > > > > > > > device for a while.
> > > > > > > > > > >
> > > > > > > > > > > - But the problem now is that the kernel is crashing almost immediately
> > > > > > > > > > > and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > > > > > > 13, all other systems (including Dell XPS 15) seem to work without
> > > > > > > > > > > issues.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > > > > > About the symptoms Wi reports:
> > > > > > > > > > > So up until this point, everything is working without issues.
> > > > > > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > > > > > the crash states I will see this:
> > > > > > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > > > > > timestamp):
> > > > > > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > > > > > >
> > > > > > > > > > > And Thomas Krause reports:
> > > > > > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > > > > > (sometimes even before showing the login screen).
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > > > > > >
> > > > > > > > > > Hi Kalle,
> > > > > > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > > > > > to see if there are any differences.
> > > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > > > > > message as the adapter was coming up/associating, shortly after the
> > > > > > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > > > > > reboot to bring the adapter back.
> > > > > > > >
> > > > > > > > Kalle -
> > > > > > > > I've noticed one additional behavior that may give someone with
> > > > > > > > familiarity with the QCA hardware a clue. I'm running
> > > > > > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > > > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > > > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > > > > > throttling issue as the ath11k driver is initializing / associating.
> > > > > > > > The bluetooth system is using the btqca driver. I don't have any
> > > > > > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > > > > > other than tracking some simple statistics. I booted my system 20
> > > > > > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > > > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > > > > > into X and manually modprobing the ath11k driver. The difference is
> > > > > > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > > > > > headphones are paired and I received the throttling message and
> > > > > > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > > > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > > > > > but it's reliably reproducible for me, is there anything useful I can
> > > > > > > > collect?
> > > > > > > >
> > > > > > > > Message: 2
> > > > > > > > Date: Sun, 6 Dec 2020 09:05:57 +0100
> > > > > > > > From: wi nk wink at technolu.st
> > > > > > > > To: Kalle Valo kvalo at codeaurora.org
> > > > > > > > Cc: Thomas Krause thomaskrause at posteo.de, ath11k at lists.infradead.org
> > > > > > > > Subject: Re: ath11k: QCA6390 on Dell XPS 13 and kernel crashes
> > > > > > > > Message-ID:
> > > > > > > > CAHUdJJU0ykf96GbaMrhkcPv2xSF62CDPNSNSgtoGP6BtBTAk6Q at mail.gmail.com
> > > > > > > > Content-Type: text/plain; charset="UTF-8"
> > > > > > > > On Sat, Dec 5, 2020 at 8:17 PM wi nk wink at technolu.st wrote:
> > > > > > > >
> > > > > > > > > On Tue, Dec 1, 2020 at 11:17 AM wi nk wink at technolu.st wrote:
> > > > > > > > >
> > > > > > > > > > On Mon, Nov 30, 2020 at 6:02 PM wi nk wink at technolu.st wrote:
> > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 30, 2020 at 5:55 PM Kalle Valo kvalo at codeaurora.org wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Wi and Thomas,
> > > > > > > > > > > > I'll start a new thread about problems on XPS 13. The information is
> > > > > > > > > > > > scattered to different threads and hard to find everything, it's much
> > > > > > > > > > > > easier to have everything in one place. So let's continue the discussion
> > > > > > > > > > > > about the kernel crashes on this thread.
> > > > > > > > > > > > Here's what I have understood so far:
> > > > > > > > > > > >
> > > > > > > > > > > > - On Dell XPS 15 there are no issues with QCA6390 and it seems to work
> > > > > > > > > > > > with 32 MSI vectors.
> > > > > > > > > > > >
> > > > > > > > > > > > - On Dell XPS 13 there's a BIOS bug and kernel prints:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [ 0.050130] DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
> > > > > > > > > > > > BIOS vendor: Dell Inc.; Ver: 1.1.1; Product Version:
> > > > > > > > > > > >
> > > > > > > > > > > > - Because of this BIOS bug QCA6390 only gets one MSI vector on Dell XPS
> > > > > > > > > > > > 13. We added a hack to ath11k make it work with only vector and after
> > > > > > > > > > > > that it's possible to boot the firmware, connect to the AP and use the
> > > > > > > > > > > > device for a while.
> > > > > > > > > > > >
> > > > > > > > > > > > - But the problem now is that the kernel is crashing almost immediately
> > > > > > > > > > > > and almost every time(?). And these crashes only happen on Dell XPS
> > > > > > > > > > > > 13, all other systems (including Dell XPS 15) seem to work without
> > > > > > > > > > > > issues.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Is my understanding correct? Did I miss anything?
> > > > > > > > > > > > About the symptoms Wi reports:
> > > > > > > > > > > > So up until this point, everything is working without issues.
> > > > > > > > > > > > Everything seems to spiral out of control a couple of seconds later
> > > > > > > > > > > > when my system attempts to actually bring up the adapter. In most of
> > > > > > > > > > > > the crash states I will see this:
> > > > > > > > > > > > [ 31.286725] wlp85s0: send auth to ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > > > > > [ 31.390187] wlp85s0: send auth to ec:08:6b:27:01:ea (try 2/3)
> > > > > > > > > > > > [ 31.391928] wlp85s0: authenticated
> > > > > > > > > > > > [ 31.394196] wlp85s0: associate with ec:08:6b:27:01:ea (try 1/3)
> > > > > > > > > > > > [ 31.396513] wlp85s0: RX AssocResp from ec:08:6b:27:01:ea
> > > > > > > > > > > > (capab=0x411 status=0 aid=6)
> > > > > > > > > > > > [ 31.407730] wlp85s0: associated
> > > > > > > > > > > > [ 31.434354] IPv6: ADDRCONF(NETDEV_CHANGE): wlp85s0: link becomes ready
> > > > > > > > > > > > And then either somewhere in that pile of messages, or a second or two
> > > > > > > > > > > > after this my machine will start to stutter as I mentioned before, and
> > > > > > > > > > > > then it either hangs, or I see this message (I'm truncating the
> > > > > > > > > > > > timestamp):
> > > > > > > > > > > > [ 35.xxxx ] sched: RT throttling activated
> > > > > > > > > > > > After that moment, the machine is unresponsive. Sorry I can't seem to
> > > > > > > > > > > > extract this data other than screenshots from my phone at the moment,
> > > > > > > > > > > > you can see the dmesg output from 6 different hangs here:
> > > > > > > > > > > > https://github.com/w1nk/ath11k-debug
> > > > > > > > > > > >
> > > > > > > > > > > > And Thomas Krause reports:
> > > > > > > > > > > > I can confirm this behavior on my configuration. I managed to login
> > > > > > > > > > > > once and select the Wifi and connect to it. It seemed curiously enough
> > > > > > > > > > > > be stable long enough to enter the Wifi passphrase. After the
> > > > > > > > > > > > connection was established, the system hang and on each attempt to
> > > > > > > > > > > > reboot into the graphical system it would freeze at some point
> > > > > > > > > > > > (sometimes even before showing the login screen).
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > https://patchwork.kernel.org/project/linux-wireless/list/
> > > > > > > > > > > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > > > > > > > > >
> > > > > > > > > > > Hi Kalle,
> > > > > > > > > > > Again, thanks much for your work. I think you've summarized
> > > > > > > > > > > everything up until this point. On my XPS 13 9310 The behavior of the
> > > > > > > > > > > RT throttling still exists for me occasionally on loading the
> > > > > > > > > > > driver/associating with an AP. The throttling consistently occurs
> > > > > > > > > > > after a few sets of the MHI debug printing showing the EE entering an
> > > > > > > > > > > invalid state ( AMSS -> INVALID_EE ). I'm now building the latest tag
> > > > > > > > > > > to see if there are any differences.
> > > > > > > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > Just to follow up, the first boot resulted in the RT throttling
> > > > > > > > > > message as the adapter was coming up/associating, shortly after the
> > > > > > > > > > firmware crashed and the kernel didn't fully freeze, but I needed to(
> > > > > > > > > > reboot to bring the adapter back.
> > > > > > > > >
> > > > > > > > > Kalle -
> > > > > > > > > I've noticed one additional behavior that may give someone with
> > > > > > > > > familiarity with the QCA hardware a clue. I'm running
> > > > > > > > > ath11k-qca6390-bringup-202011301608 on the dell xps 13 9310. For
> > > > > > > > > whatever reason, having the bluetooth subsystem enabled (with a paired
> > > > > > > > > device) on this dell basically guarantees I'll hit the scheduler
> > > > > > > > > throttling issue as the ath11k driver is initializing / associating.
> > > > > > > > > The bluetooth system is using the btqca driver. I don't have any
> > > > > > > > > useful debugging (I'll gladly collect some if there is a way to do it)
> > > > > > > > > other than tracking some simple statistics. I booted my system 20
> > > > > > > > > times, 10 times with bluetooth enabled ((and some headphones turned on
> > > > > > > > > ready to pair), and 10 times without. In both scenarios, I'm booting
> > > > > > > > > into X and manually modprobing the ath11k driver. The difference is
> > > > > > > > > that with bluetooth on and by the time I modprobe the driver, the
> > > > > > > > > headphones are paired and I received the throttling message and
> > > > > > > > > subsequent freezing 10/10 times. With bluetooth off / my headphones
> > > > > > > > > not paired, I only saw it 2/10. I know it's not much hard information
> > > > > > > > > but it's reliably reproducible for me, is there anything useful I can
> > > > > > > > > collect?
> > > > > > > >
> > > > > > > > Well unfortunately I think the bluetooth was just a red herring in the
> > > > > > > > racing. To chase that, I disabled all bluetooth and was able to get
> > > > > > > > into a state where I had 6 failed boots in a row. To further poke
> > > > > > > > around, I rebuilt the kernel with localmodconfig to disable building
> > > > > > > > big chunks of things. This kernel is way less stable and seems to
> > > > > > > > freeze most of the time (but does occasionally remain stable), I'm not
> > > > > > > > sure what else got disabled in there, but it seems to have had a
> > > > > > > > negative impact on the crash racing.
> > > > > > > >
> > > > > > > > Subject: Digest Footer
> > > > > > > > ath11k mailing list
> > > > > > > > ath11k at lists.infradead.org
> > > > > > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > > > > > >
> > > > > > > > End of ath11k Digest, Vol 7, Issue 5
> > > > > > >
> > > > > > > --
> > > > > > > ath11k mailing list
> > > > > > > ath11k at lists.infradead.org
> > > > > > > http://lists.infradead.org/mailman/listinfo/ath11k
> > > > > >
> > > > > > Hey Mitchell,
> > > > > > One more thing to try that may help us get a little bit of extra
> > > > > > info. Out of everything I've done, something that has remained
> > > > > > consistent is to enable the MHI debugging as Kalle suggested:
> > > > > > sudo sh -c "echo -n 'module mhi +p' > /sys/kernel/debug/dynamic_debug/control"
> > > > > > Before any crash/spinlock, I see the MHI printing (from
> > > > > > drivers/bus/mhi/core/main.c L389) show the EE enter an invalid state
> > > > > > and then after a number more iterations through this function, things
> > > > > > finally go out of control. So from
> > > > > >
> > > > > > dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\\n",
> > > > > > TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
> > > > > > TO_MHI_STATE_STR(state));
> > > > > >
> > > > > >
> > > > > > I'll see something like this:
> > > > > > [ 312.xxx] mhi 0000:55:00.0: local ee:AMSS device ee:AMSS dev_state:M2
> > > > > > [ 313.024033] mhi 0000:55:00.0: local ee:INVALID_EE device
> > > > > > ee:INVALID_EE dev_state:SYS_ERR
> > > > > > Then after a few of those prints showing SYS_ERR, either a spinlock or
> > > > > > a firmware crash. I'm not sure what causes this ee state to go
> > > > > > invalid, but maybe that's worth looking into. Can you confirm the
> > > > > > same behavior? To see this a little easier, I also run dmesg -wH in
> > > > > > two windows, one piping to | grep -v mhi (to filter out the mhi
> > > > > > debugging).
> > > > > > Thanks!
> > > > >
> > > > > So I've managed to stabilise my system now, so either the race is
> > > > > gone, or I've done something to win it all the time. So one of the
> > > > > avenues of racing I was chasing at first was in the ath11k driver
> > > > > itself. There are a couple areas where the single/shared IRQ is being
> > > > > forcibly toggled in ways that the documentation says are not great
> > > > > (and the original patch was trying to avoid). Fixing those didn't
> > > > > seem to have much impact on the stability of things (I've included
> > > > > those changes in my patch though). After the last email I was
> > > > > thinking about the MHI side of things a bit more and found a number of
> > > > > call sites that my naive grepping had missed that do the same thing,
> > > > > but via acquiring a lock at the same time. I modified all the calls
> > > > > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > > > variants that accept the flags parameter to capture state. I've now
> > > > > booted and loaded the driver 10+ times without a single freeze or
> > > > > crash. I'm not sure all of those modifications are necessary (ie:
> > > > > which things are re-entrant in this single interrupt operating mode vs
> > > > > which ones can use the simpler lock/unlock mechanisms), so I could use
> > > > > some advice/guidance there.
> > > > > Mitchell - if you want to grab this patch and try it, let me know how
> > > > > it goes and I can clean it up for the mailing list:
> > > > > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > > > (apply to ath11k-qca6390-bringup-202011301608)
> > > >
> > > > Blindly chasing the crashing, I've found one more probably relevant
> > > > piece of information. As I was playing around trying to see if I had
> > > > actually stopped the racing, I noticed my battery was low. I plugged
> > > > it in and immediately received the RT throttling crash. I've now tried
> > > > quite a bit, and on the battery I don't see the crashing. I thought
> > > > maybe dynamic CPU clocking is changing some of the racing properties.
> > > > When I bring everything up on the battery and wait around a bit, once
> > > > I plug in the usb-c cable, within a few seconds it will often trigger
> > > > the RT throttling message. I poked a little bit at some of the wifi
> > > > power management settings, specifically trying to disable them, but I
> > > > didn't seem to kick anything relevant yet. I can essentially use the
> > > > power cable as a trigger for this race though..
> > > >
> > > > Kalle - are you aware of anything that happens to the driver/adapter
> > > > when ac power shows up? I think I see some power saving stuff in
> > > > wmi.c but I haven't gotten deep enough to know...
> > >
> > > </wink at technolu.st>
> >
> > Mitchell - one thing to note re the mhi debugging, the module needs to
> > be in place first. Here's how I've been doing it:
> >
> > modprobe ath11k_pci; echo -n 'module mhi +p' >
> > /sys/kernel/debug/dynamic_debug/control; dmesg -wH
> >
> > In the previously linked git repo, I've added my kernel build config,
> > that may be worth trying. Another change I've made that seems to help
> > is to completely disable power management for 80211 in the kernel.
> > Between that and setting ubuntu to leave the iwconfig things alone, it
> > seems to have resolved the power plugging stuff. I'm guessing the
> > real racing is related to just attempting to configure/reconfigure
> > settings in the adapter (which is why we're seeing crashing when it
> > tries to actually attempt to 'do things', like associate or modify
> > operational configs, before it goes nuts). The thing that's weird is
> > that I'm assuming the instability has been introduced due to the
> > shared IRQ since presumably this driver works for the previous pieces
> > of hardware the chipset was put into, but specifically in those
> > codepaths, there's nothing obviously related to the single IRQ. Which
> > leads my thoughts back to timing/synchronization issues...
>
> While I'm semi-randomly poking things I decided to capture some
> information in a structured way that could be useful to Kalle and
> team. I'm running the latest bringup branch without any
> modifications. I booted my machine 6 consecutive times to demonstrate
> the power triggering the freezing I was referring to. In each video,
> you'll see the dmesg output, and in the cases I can control, you'll
> also see it with MHI debugging.
>
> The first 2 boots, I'm intentionally booting / initializing the driver
> on battery power and then waiting 5+ minutes to plug in the charger.
> Note: the system always comes online and remains stable when I start
> in this configuration, it's only when I plug the charger in that it
> crashes.
>
> Boot 1: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_004643171.mp4
> - The machine and driver has been online and stable for 5 minutes (as
> seen in htop/ping), within a few seconds of plugging in the usb
> charger, the mhi debugging shows a failure and the machine crashes.
>
> Boot 2 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005346443.mp4
> - Same set up (although the machine had been up for 6 minutes at that
> point) and failure as boot 1. The machine hard locks instantly this
> time, as opposed to the stuttering you can see in boot 1.
>
> For the next boots, I'm booting / initializing the driver with the
> charger plugged in ahead of time:
>
> Boot 3 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005642416.mp4
> - Within a few seconds of the driver initializing, the machine
> crashes.
>
> Boot 4 : https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005800378.mp4
> - Same setup as boot 3, but this time the system survives a bit longer
> (15 seconds or so).
>
> Boot 5: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_005938734.mp4
> - Same setup as 3/4, similar crash to boot 4. The driver survives ~15
> seconds and then the machine hangs.
>
> After this I went back to the setup for boot 1/2 where I brought
> everything online, waited a bit over 5 minutes and plugged in the
> charger.
>
> Boot 6: https://github.com/w1nk/ath11k-debug/blob/master/PXL_20201209_010537553.mp4
> - This boot was successful and has remained stable. I'm composing
> this email from it. If this follows previous behavior, it should stay
> online for at least 24h (I always fiddled beyond that).
>
> So in conclusion, I wanted to demonstrate that clearly being on
> battery power is causing something that is enabling my system to be
> stable in a way that goes away when I plug in my charger (both up
> front, and after initialization). I don't have any great ideas of
> what could be going on, I'm not entirely sure it's directly power
> related but when I toggle it, clearly something is linked (maybe back
> to the ACPI tables being borked?). I'll leave this boot running as
> long as I can to see if it randomly crashes after an hour...
Github didn't appreciate hosting those mp4s too much, I've reuploaded
them here as well:
https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
More information about the ath11k
mailing list