ath11k: QCA6390 on Dell XPS 13 and kernel crashes

wi nk wink at technolu.st
Fri Dec 11 07:28:08 EST 2020


On Wed, Dec 9, 2020 at 10:46 PM wi nk <wink at technolu.st> wrote:
>
> On Wed, Dec 9, 2020 at 4:55 PM wi nk <wink at technolu.st> wrote:
> >
> > On Wed, Dec 9, 2020 at 4:50 PM Kalle Valo <kvalo at codeaurora.org> wrote:
> > >
> > > wi nk <wink at technolu.st> writes:
> > >
> > > > On Wed, Dec 9, 2020 at 4:35 PM Kalle Valo <kvalo at codeaurora.org> wrote:
> > > >>
> > > >> wi nk <wink at technolu.st> writes:
> > > >>
> > > >> > So I've managed to stabilise my system now, so either the race is
> > > >> > gone, or I've done something to win it all the time.  So one of the
> > > >> > avenues of racing I was chasing at first was in the ath11k driver
> > > >> > itself.  There are a couple areas where the single/shared IRQ is being
> > > >> > forcibly toggled in ways that the documentation says are not great
> > > >> > (and the original patch was trying to avoid).  Fixing those didn't
> > > >> > seem to have much impact on the stability of things (I've included
> > > >> > those changes in my patch though).  After the last email I was
> > > >> > thinking about the MHI side of things a bit more and found a number of
> > > >> > call sites that my naive grepping had missed that do the same thing,
> > > >> > but via acquiring a lock at the same time.  I modified all the calls
> > > >> > to *_lock_irq and *_unlock_irq to the lock/unlock - save/restore
> > > >> > variants that accept the flags parameter to capture state.  I've now
> > > >> > booted and loaded the driver 10+ times without a single freeze or
> > > >> > crash.  I'm not sure all of those modifications are necessary (ie:
> > > >> > which things are re-entrant in this single interrupt operating mode vs
> > > >> > which ones can use the simpler lock/unlock mechanisms), so I could use
> > > >> > some advice/guidance there.
> > > >> >
> > > >> > Mitchell - if you want to grab this patch and try it, let me know how
> > > >> > it goes and I can clean it up for the mailing list:
> > > >> > https://github.com/w1nk/ath11k-debug/blob/master/one-irq-manage.patch
> > > >> > (apply to ath11k-qca6390-bringup-202011301608)
> > > >>
> > > >> Wink, I want to ask more about your the very interesting
> > > >> one-irq-manage.patch you wrote. Have you seen the "sched: RT throttling
> > > >> activated" crash with that patch? If yes, how many times, for example 5
> > > >> out of 10 times or something like that?
> > > >>
> > > >> Or is it so with one-irq-manage.patch the kernel doesn't crash at all? I
> > > >> didn't quite understand the situation.
> > > >>
> > > >> --
> > > >> https://patchwork.kernel.org/project/linux-wireless/list/
> > > >>
> > > >> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > > >
> > > > Kalle,
> > > >
> > > >    Sorry for moving the thread :).
> > >
> > > No problem, I'll just make extra questions to make sure that I'm
> > > understanding things correctly :)
> > >
> > > > So I've attempted 2 patches that seem to produce varying degrees of
> > > > success. The single IRQ patch took the crashing behaviour from hard
> > > > locking immediately, to that stuttering / RT throttling message
> > > > consistently. So instead of hard locking 9/10 times and stuttering
> > > > 1/10, it was inverted.
> > >
> > > Ok, got it now.
> > >
> > > > The second patch disabling the m2 transition (even without the single
> > > > IRQ patch) seems to have resolved the issues altogether, but at the
> > > > expense of disabling this m2 state, which I don't have much idea of
> > > > the consequences..
> > >
> > > Sorry, I have missed that. What second patch are you talking about?
> > >
> > > Also can you share your /proc/interrupts in full?
> > >
> > > --
> > > https://patchwork.kernel.org/project/linux-wireless/list/
> > >
> > > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
> > >
> > > --
> > > ath11k mailing list
> > > ath11k at lists.infradead.org
> > > http://lists.infradead.org/mailman/listinfo/ath11k
> >
> > Here's interrupts in full , and the short patch after:
> >
> >             CPU0       CPU1       CPU2       CPU3       CPU4
> > CPU5       CPU6       CPU7
> >    0:          7          0          0          0          0
> > 0          0          0   IO-APIC    2-edge      timer
> >    1:          0          0          0          0          0
> > 0          0       2923   IO-APIC    1-edge      i8042
> >    8:          0          0          0          0          0
> > 0          0          0   IO-APIC    8-edge      rtc0
> >    9:          0       9290          0          0          0
> > 0          0          0   IO-APIC    9-fasteoi   acpi
> >   12:          0          0          0          0          0
> > 0         53          0   IO-APIC   12-edge      i8042
> >   14:          0      29816          0          0          0
> > 0          0          0   IO-APIC   14-fasteoi   INT34C5:00
> >   16:          0          0          0          0          0
> > 10376          0          0   IO-APIC   16-fasteoi   intel_ish_ipc,
> > i801_smbus, idma64.4
> >   27:          0          0          0          0          0
> > 0          0          0   IO-APIC   27-fasteoi   idma64.0,
> > i2c_designware.0
> >   31:          0          0          0          0          0
> > 0          0          0   IO-APIC   31-fasteoi   idma64.2,
> > i2c_designware.2
> >   32:          0          0          0          0          0
> > 0          0          0   IO-APIC   32-fasteoi   idma64.3,
> > i2c_designware.3
> >   40:       9681     777197      27906          0          0
> > 0          0          0   IO-APIC   40-fasteoi   idma64.1,
> > i2c_designware.1
> >  120:          0          0          0          0          0
> > 0          0          0   PCI-MSI 114688-edge      PCIe PME, pciehp
> >  121:          0          0          0          0          0
> > 0          0          0   PCI-MSI 118784-edge      PCIe PME, pciehp
> >  122:          0          0          0          0          0
> > 0          0          0   PCI-MSI 458752-edge      PCIe PME
> >  123:          0          0          0          0          0
> > 0          0          0   PCI-MSI 475136-edge      PCIe PME
> >  124:          0          0          1          0          0
> > 0          0          0   PCI-MSI 229376-edge      vmd
> >  125:          0          0          0         27          0
> > 0          0          0   PCI-MSI 229377-edge      vmd
> >  126:          0          0          0          0       4303
> > 0          0          0   PCI-MSI 229378-edge      vmd
> >  127:          0          0          0          0          0
> > 2992          0        434   PCI-MSI 229379-edge      vmd
> >  128:          0          0          0          0          0
> > 593       2504          0   PCI-MSI 229380-edge      vmd
> >  129:          0          0          0          0        699
> > 0       1061       1873   PCI-MSI 229381-edge      vmd
> >  130:       2382        394          0        603          0
> > 0          0          0   PCI-MSI 229382-edge      vmd
> >  131:          0       1670          0        406        646
> > 0          0          0   PCI-MSI 229383-edge      vmd
> >  132:        692          0       2903          0          0
> > 0          0          0   PCI-MSI 229384-edge      vmd
> >  133:          0        518        913       2198          0
> > 0          0          0   PCI-MSI 229385-edge      vmd
> >  134:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229386-edge      vmd
> >  135:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229387-edge      vmd
> >  136:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229388-edge      vmd
> >  137:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229389-edge      vmd
> >  138:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229390-edge      vmd
> >  139:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229391-edge      vmd
> >  140:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229392-edge      vmd
> >  141:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229393-edge      vmd
> >  142:          0          0          0          0          0
> > 0          0          0   PCI-MSI 229394-edge      vmd
> >  143:          0          0          0          0          0
> > 0          0          0   VMD-MSI  124  PCIe PME, aerdrv, pcie-dpc
> >  144:          0          0          0          0          0
> > 0          1          0   PCI-MSI 212992-edge      xhci_hcd
> >  145:          0          0          0          0          0
> > 0          0         72   PCI-MSI 327680-edge      xhci_hcd
> >  146:          6          0          0          0          0
> > 0          0          0   PCI-MSI 45088768-edge      rtsx_pci
> >  147:          0          0          0          0          0
> > 0          0          0   VMD-MSI  125  nvme0q0
> >  148:          0          0          0       1859          0
> > 0          0      38399   PCI-MSI 32768-edge      i915
> >  149:          0          0          0          0          0
> > 0          0          0   VMD-MSI  126  nvme0q1
> >  150:          0          0          0          0          0
> > 0          0          0   VMD-MSI  127  nvme0q2
> >  151:          0          0          0          0          0
> > 0          0          0   VMD-MSI  128  nvme0q3
> >  152:          0          0          0          0          0
> > 0          0          0   VMD-MSI  129  nvme0q4
> >  153:          0          0          0          0          0
> > 0          0          0   VMD-MSI  130  nvme0q5
> >  154:          0          0          0          0          0
> > 0          0          0   VMD-MSI  131  nvme0q6
> >  155:          0          0          0          0          0
> > 0          0          0   VMD-MSI  132  nvme0q7
> >  156:          0          0          0          0          0
> > 0          0          0   VMD-MSI  133  nvme0q8
> >  157:          0      29816          0          0          0
> > 0          0          0  INT34C5:00  327  DLL0945:00
> >  158:          0          0          0          0          0
> > 0         48          0   PCI-MSI 360448-edge      mei_me
> >  159:          0          0          0          0          0
> > 0          0       1134   PCI-MSI 514048-edge      AudioDSP
> >  162:          0          0          0     108102          0
> > 0          0          0   PCI-MSI 44564480-edge      ce0, ce1, ce2,
> > ce3, ce5, ce7, ce8, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> > DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ, DP_EXT_IRQ,
> > DP_EXT_IRQ, bhi, mhi, mhi
> >  NMI:          0          0          0          0          0
> > 0          0          0   Non-maskable interrupts
> >  LOC:      64516      80387      54151      82574      64663
> > 113373      58033      81555   Local timer interrupts
> >  SPU:          0          0          0          0          0
> > 0          0          0   Spurious interrupts
> >  PMI:          0          0          0          0          0
> > 0          0          0   Performance monitoring interrupts
> >  IWI:          5          2          1        760          1
> > 1          0      16078   IRQ work interrupts
> >  RTR:          6          0          0          0          0
> > 0          0          0   APIC ICR read retries
> >  RES:       1834       7304       1432       1807       3015
> > 1552       1417       1498   Rescheduling interrupts
> >  CAL:      21739      26798      28934      22211      22590
> > 28622      22541      20023   Function call interrupts
> >  TLB:      51267      49182      59392      48384      46755
> > 56491      48103      46560   TLB shootdowns
> >  TRM:          2          2          2          2          2
> > 2          2          2   Thermal event interrupts
> >  THR:          0          0          0          0          0
> > 0          0          0   Threshold APIC interrupts
> >  DFR:          0          0          0          0          0
> > 0          0          0   Deferred Error APIC interrupts
> >  MCE:          0          0          0          0          0
> > 0          0          0   Machine check exceptions
> >  MCP:          3          4          4          4          4
> > 4          4          4   Machine check polls
> >  ERR:         16
> >  MIS:          0
> >  PIN:          0          0          0          0          0
> > 0          0          0   Posted-interrupt notification event
> >  NPI:          0          0          0          0          0
> > 0          0          0   Nested posted-interrupt event
> >  PIW:          0          0          0          0          0
> > 0          0          0   Posted-interrupt wakeup event
> >
> > and the modification that disables m2 state:
> >
> > diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> > index 3de7b1639ec6..20f670c8b129 100644
> > --- a/drivers/bus/mhi/core/pm.c
> > +++ b/drivers/bus/mhi/core/pm.c
> > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const
> > dev_state_transitions[] = {
> >      },
> >      {
> >          MHI_PM_M0,
> > -        MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > +        MHI_PM_M0 | MHI_PM_M3_ENTER |
> >          MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >          MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> >      },
> >      {
> > -        MHI_PM_M2,
> > +        MHI_PM_M0,
> >          MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> >          MHI_PM_LD_ERR_FATAL_DETECT
> >      },
>
> Adding one more data point.  The driver will not crash on
> initialization this way, but also with the M2 state transition
> disabled the system survives suspend and wake and the adapter
> successfully reassociates consistently.  As expected with my patch,
> the MHI driver shows everything stays in the M1 state instead of
> attempting to transition to M2 ever.  It also doesn't return back to
> M0 if I disconnect the power / replug it.  I'm not sure what things
> are affected by me hacking this state machine, but avoiding that M2
> transition has removed any obvious issues from my system.

While waiting for someone else to confirm, I can report that I've
still not seen any instability since this patch.  The laptop has been
stable through reboots, power cycling, suspension, etc.  I'd be happy
to continue to try to understand why this is this case.  It sounds
like Stephen isn't seeing these issues on 5.10 rc6 with the single msi
patch+reverting that one commit.  I can try to give that a shot if
it'd produce something useful.

Kalle - a couple quick questions, in the driver comments the M2 state
is loosely documented as a low power mode.  Why would it transition to
that while on charger/plugging in, but stay in M0 while on battery
(you can see this behavior in the videos I linked previously)?
Naively I would've expected the opposite behavior.  Also, is there any
way to prevent that transition other than my brute force?  It seems on
battery the 'nominal' state for it is M0, I'm not sure what the effect
of it being left in this M1 state really is even though there's
nothing observable.  Lastly, any thoughts as to why it seems that
transition causes the EE state to become invalid?

Thanks!



More information about the ath11k mailing list