ath11k: crashes with 1 MSI vector, workaround disable MHI M2 state
Manivannan Sadhasivam
manivannan.sadhasivam at linaro.org
Sun Dec 20 10:05:03 EST 2020
Hi,
On Sat, Dec 19, 2020 at 10:34:23PM +0100, wi nk wrote:
> On Thu, Dec 17, 2020 at 10:53 AM Manivannan Sadhasivam
> <manivannan.sadhasivam at linaro.org> wrote:
> >
> > Hi Kalle,
> >
> > On Wed, Dec 16, 2020 at 10:47:18AM +0200, Kalle Valo wrote:
> > > Hi MHI devs,
> > >
> >
> > [...]
> >
> > > After extensive debugging from wink he found out that disabling M2 state
> > > makes the all problems go away:
> > >
> > > --- a/drivers/bus/mhi/core/pm.c
> > > +++ b/drivers/bus/mhi/core/pm.c
> > > @@ -55,12 +55,12 @@ static struct mhi_pm_transitions const dev_state_transitions[] = {
> > > },
> > > {
> > > MHI_PM_M0,
> > > - MHI_PM_M0 | MHI_PM_M2 | MHI_PM_M3_ENTER |
> > > + MHI_PM_M0 | MHI_PM_M3_ENTER |
> > > MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > MHI_PM_LD_ERR_FATAL_DETECT | MHI_PM_FW_DL_ERR
> > > },
> > > {
> > > - MHI_PM_M2,
> > > + MHI_PM_M0,
> > > MHI_PM_M0 | MHI_PM_SYS_ERR_DETECT | MHI_PM_SHUTDOWN_PROCESS |
> > > MHI_PM_LD_ERR_FATAL_DETECT
> > > },
> > >
> > > And indeed now we have numerous people reporting that with this
> > > workaround ath11k is stable on their Dell XPS 13 9310 laptops. What on
> > > earth could cause these kernel crashes/interrupt storms? And why is it
> > > visible only on Dell laptops? Why does disabling M2 state fix it?
> > >
> >
> > This is related to the ASPM state of the PCIe bus. In the meantime, I'd
> > suggest to turn off ASPM using "pcie_aspm=off" in the kernel command
> > line so that the MHI bus stays in M0.
> >
> > For debugging this issue, can someone enable debug logs for MHI and share
> > the dmesg output (with ASPM enabled ofc)?
> >
> > Thanks,
> > Mani
>
> Hi Mani,
>
> Thanks for the information and ideas. I tried to disable ASPM with
> the kernel parameter you mentioned, that didn't seem to work, so I
> removed ASPM support from my kernel altogether. I still see the
> adapter in the M1 state, which with my patch would've gone to M2 had
> it not been disabled. Is ASPM the only thing that will trigger the M*
> transitions? Would it require a transition to M2 regardless of
> settings (maybe that's why it tried)?
That's what I suspected but looks like the QCA6390 enters M1 state (which will
inturn cause host MHI to transition to M2) when it detects the link inactivity
using a timer. But with my NUC, I can't get QCA6390 to enter M1 state
regardless of the ASPM support in BIOS. In both conditions (with/without ASPM)
device just stays in M0. My hard is guess is that the device depends on the WAKE
sideband signal to go low for entering the M1 state even when it detects link
inactivity in the PCIe bus. And this signal might be low on Dell laptops. But I
need Hemant/Bhaumik to confirm this!
Inspite of that, I got a plenty of below messages in dmesg log when MHI
debugging is enabled:
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
local ee:AMSS device ee:AMSS dev_state:M0
And this only happens when one MSI vector is used and shared by all IRQs. This
is because for shared IRQs, the kernel calls all of the registered ISRs of the
interrupt line when an interrupt occurs. So I cooked up a patch which checks for
the device state before proceeding through the ISR:
diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 2cff5ddff225..520948f3051d 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -386,6 +386,13 @@ irqreturn_t mhi_intvec_threaded_handler(int irq_number, void *priv)
state = mhi_get_mhi_state(mhi_cntrl);
ee = mhi_cntrl->ee;
mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
+
+ /* Only proceed if the device state is different */
+ if (mhi_cntrl->dev_state == state) {
+ write_unlock_irq(&mhi_cntrl->pm_lock);
+ goto exit_intvec;
+ }
+
dev_dbg(dev, "local ee:%s device ee:%s dev_state:%s\n",
TO_MHI_EXEC_STR(mhi_cntrl->ee), TO_MHI_EXEC_STR(ee),
TO_MHI_STATE_STR(state));
This prevents the spike of the debug messages but not sure if the issue you're
seeing is related to this. Can you give this patch a try on your setup?
Thanks,
Mani
> The MHI dmesg output is pretty
> consistent when it fails, it looks like this:
> https://i.imgur.com/0XExack.jpg . You can also see it in the mp4's
> I've placed here:
> https://drive.google.com/drive/folders/1wvxZI5XtwPSrm0-6-Ov50cUfqBXSXeNz?usp=sharing
> . Also note that the failure isn't deterministic, sometimes the
> transition to M2 will succeed and everything works.
>
> Thanks!
More information about the ath11k
mailing list