ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state

wi nk wink at technolu.st
Sun Dec 20 10:39:06 EST 2020


On Sun, Dec 20, 2020 at 4:05 PM Manivannan Sadhasivam
<manivannan.sadhasivam at linaro.org> wrote:
>
> Hi,
>
> On Sat, Dec 19, 2020 at 10:34:23PM +0100, wi nk wrote:
> > On Thu, Dec 17, 2020 at 10:53 AM Manivannan Sadhasivam
> > <manivannan.sadhasivam at linaro.org> wrote:
> > >
> > > Hi Kalle,
> > >
> > > On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > > > Hi MHI devs,
> > > >
> > >
> > > [...]
> > >
> > > > After extensive debugging from wink he found out that disabling M2 state
> > > > makes the all problems go away:
> > > >
> > > > --- a/drivers/bus/mhi/core/pm.c
> > > > +++ b/drivers/bus/mhi/core/pm.c
> > > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> > > >         },
> > > >         {
> > > >                 MHI_PM_M0,
> > > > -               MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > > +               MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > >                 MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >                 MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > >         },
> > > >         {
> > > > -               MHI_PM_M2,
> > > > +               MHI_PM_M0,
> > > >                 MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > >                 MHI_PM_LD_ERR_FATAL_DETECT
> > > >         },
> > > >
> > > > And indeed now we have numerous people reporting that with this
> > > > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > > > earth could cause these kernel crashes/interrupt storms? And why is it
> > > > visible only on Dell laptops? Why does disabling M2 state fix it?
> > > >
> > >
> > > This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> > > suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> > > line so that the MHI bus stays in M0.
> > >
> > > For debugging this issue, can someone enable debug logs for MHI and share
> > > the dmesg output (with ASPM enabled ofc)?
> > >
> > > Thanks,
> > > Mani
> >
> > Hi Mani,
> >
> >   Thanks for the information and ideas.  I tried to disable ASPM with
> > the kernel parameter you mentioned, that didn't seem to work, so I
> > removed ASPM support from my kernel altogether.  I still see the
> > adapter in the M1 state, which with my patch would've gone to M2 had
> > it not been disabled.  Is ASPM the only thing that will trigger the M*
> > transitions?  Would it require a transition to M2 regardless of
> > settings (maybe that's why it tried)?
>
> That's what I suspected but looks like the QCA6390 enters M1 state (which will
> inturn cause host MHI to transition to M2) when it detects the link inactivity
> using a timer. But with my NUC, I can't get QCA6390 to enter M1 state
> regardless of the ASPM support in BIOS. In both conditions (with/without ASPM)
> device just stays in M0. My hard is guess is that the device depends on the WAKE
> sideband signal to go low for entering the M1 state even when it detects link
> inactivity in the PCIe bus. And this signal might be low on Dell laptops. But I
> need Hemant/Bhaumik to confirm this!
>
> Inspite of that, I got a plenty of below messages in dmesg log when MHI
> debugging is enabled:
>
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
> local ee:AMSS device ee:AMSS dev_state:M0
>
> And this only happens when one MSI vector is used and shared by all IRQs. This
> is because for shared IRQs, the kernel calls all of the registered ISRs of the
> interrupt line when an interrupt occurs. So I cooked up a patch which checks for
> the device state before proceeding through the ISR:
>
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index 2cff5ddff225..520948f3051d 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -386,6 +386,13 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
>         state = mhi_get_mhi_state(mhi_cntrl);
>         ee = mhi_cntrl->ee;
>         mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
> +
> +       /* Only proceed if the device state is different */
> +       if (mhi_cntrl->dev_state == state) {
> +               write_unlock_irq(&mhi_cntrl->pm_lock);
> +               goto exit_intvec;
> +       }
> +
>         dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
>                 TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
>                 TO_MHI_STATE_STR(state));
>
> This prevents the spike of the debug messages but not sure if the issue you're
> seeing is related to this. Can you give this patch a try on your setup?
>
> Thanks,
> Mani
>
> > The MHI dmesg output is pretty
> > consistent when it fails, it looks like this:
> > https://i.imgur.com/0XExack.jpg .  You can also see it in the mp4's
> > I've placed here:
> > https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
> > .  Also note that the failure isn't deterministic, sometimes the
> > transition to M2 will succeed and everything works.
> >
> > Thanks!

Mani,

  I'll give the patch a try once I have a moment.  For some extra
information, I also patched the interrupt handlers so they'd bail
early if the signal wasn't for them (since there are ~20 ISRs attached
between MHI and the ath11k driver).  That didn't seem to have a large
effect on anything other than potentially changing some of the timing
of things.  Another interesting note is that the laptop only seems to
attempt the transition when it's plugged into the charger (see my mp4s
for more info).  If I boot the laptop on battery and bring up the
adapter, it never attempts the transition beyond m0 either.

Thanks!



More information about the ath11k mailing list