ath12k_pci errors and loss of connectivity in 6.12.y branch

Sun Jun 29 23:36:41 PDT 2025

Hi,

On 6/27/2025 3:54 PM, Robin Murphy wrote:
> +Vasant
> 
> On 2025-06-27 6:39 am, Baochen Qiang wrote:
>> [+ IOMMU list]
>>
>> On 6/27/2025 12:21 AM, Matt Mower wrote:
>>> Dear maintainer,
>>>
>>> I have been experiencing lost network connection with the ath12k_pci driver
>>> in the linux-6.12.y kernel branch. Often, when the issue occurs, the
>>> network does not recover until I reboot the computer. A full report of the
>>> errors I encounter, the symptoms that arise, and several dmesg attachments
>>> are in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1107521 . I have
>>> attached a dmesg from 6.12.34 for convenience. The short summary is:
>>>
>>> 1. I started noticing log lines like the following soon after boot when I
>>> updated from 6.12.22 to 6.12.27. After these events occur, the network goes
>>> down and often does not come back up.
>>>     ath12k_pci 0000:c2:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>>> domain=0x0010 address=0xfea00000 flags=0x0020]
>>> 2. I was able to reproduce this issue very rarely in 6.12.12 and 6.12.22.
>>> The issue always occurs soon after boot in 6.12.27, 6.12.30, 6.12.33, and
>>> 6.12.34.
>>> 3. I have not reproduced the issue in 6.15.2 or 6.15.3.
>>> 4. In some cases, when shutting down the computer, a kernel bug caused my
>>> computer to hang. I haven't determined whether this is related to the issue
>>> above or an independent issue. Search the bug report
>>> for PXL_20250611_140820085.jpg to see a picture of the kernel bug on my
>>> laptop screen.
>>> 5. I have tested two firmware versions:
>>>     a. fw_version 0x1108811c fw_build_timestamp 2025-05-17 00:21 fw_build_id
>>> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.1.c5-00284.1-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
>>>     b. fw_version 0x100301e1 fw_build_timestamp 2023-12-06 04:05 fw_build_id
>>> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
>>>
>>> Thanks,
>>> Matt
>>>
>>
>> I had a quick test with 6.12.27 kernel on both my Intel desktop and AMD RD but
>> didn't hit
>> the issue. And I am using WLAN.HMT.1.1.c5-00284.1-
>> QCAHMTSWPL_V1.0_V2.0_SILICONZ-3.
>>
>> As mentioned in the Debian bug report, since reverting ath12k patches does not
>> fix this
>> issue, maybe it comes from the IOMMU subsystem?
> 
> Faults are usually still indicative of the client driver/subsystem doing
> something not quite right - racily performing dma_unmap before the device has
> actually finished making accesses; mapping the wrong size such that the device
> accesses off the end of the mapping (this can often run into another valid
> mapping so not necessarily fault); mapping the wrong DMA direction such that the
> device then tries to write to a read-only page. However I suppose it's not
> impossible that some fix to amd-iommu in that period might have changed its
> behaviour in a way that exacerbates things - Vasant, does this strike a chord
> with anything you're aware of?

I did look into kernel code and changes between v6.12.9..v6.12.22.. There are
only two changes in AMD iommu driver.

40c731472f41 iommu/amd: Expicitly enable CNTRL.EPHEn bit in resume path
   -> This one was needed to fix the suspend/resume issue. This just adjusts
control bit after suspend. Its not touching page table.
6e1e451456e1 iommu/amd: Remove unused amd_iommu_domain_update()
   - Code cleanup patch.

Looking into lspci output only `c2:00.0` is placed in group 15 and domain ID
0x10. I believe there is only one device in this domain.

Interpreting IO_PAGE_FAULT flags = 0x20 means It was a write request for the
page that was not present. So at this point I would still suspect on device
driver side than IOMMU side.

> 
> A couple more things I'd try on the ath12k side: firstly, boot with
> "iommu.strict=1" and see if that makes the faults any more frequent/
> reproducible; if a fault is fairly easily reproducible, then use the DMA API
> and/or IOMMU API tracepoints to compare the fault address to prior DMA mapping
> activity - that can usually reveal the nature of the bug enough to then know
> what to go looking for.
> 
> I wouldn't put much significance in whatever happens *after* the fault -
> presumably the driver is assuming the blocked DMA write has completed, so then
> goes on to read some incomplete descriptor as if it were valid, and thus may
> fall over in all manner of entertaining ways on bogus data.

Thanks Robin. I'd suggest to follow these suggestions.

-Vasant