ath12k_pci errors and loss of connectivity in 6.12.y branch

Wed Jul 2 00:10:34 PDT 2025

On 7/2/2025 1:17 PM, Matt Mower wrote:
>> A couple more things I'd try on the ath12k side: firstly, boot with
>> "iommu.strict=1" and see if that makes the faults any more
>> frequent/reproducible;
> 
> The issue is easy enough to reproduce in 6.12.27 onward and I may be
> mistaken about the rarity in 6.12.22; I reproduced it relatively
> quickly in .22 today, so if this was the primary purpose for setting
> iommu.strict=1, then testing with or without strict works. FWIW, I did
> test iommu.strict=1 with 6.15.3 and still have not reproduced this
> issue there.
> 
>> if a fault is fairly easily reproducible, then
>> use the DMA API and/or IOMMU API tracepoints to compare the fault
>> address to prior DMA mapping activity - that can usually reveal the
>> nature of the bug enough to then know what to go looking for.
> 
> This is unfamiliar territory for me, so I hope the following is at
> least close to what you requested. If not, happy to provide more test
> results based on a set of instructions. Here's what I did:
> 
> 1. Set CONFIG_DMA_API_DEBUG=y
> 2. Set kernel command line to: iommu.strict=1 log_buf_len=100M
> dma_debug_driver=ath12k_pci trace_event=dma:*,iommu:*
> 3. Booted and waited for page fault, then cat'd
> /sys/kernel/tracing/trace to a file.
> 
> Additionally, though I'm pretty sure this is irrelevant now, I added
> logging after each dma_map_single() in the ath12k driver to print the
> function name and resultant address to the kernel log.
> 
> Comparing the addresses of several io_page_fault lines in the trace
> and in the kernel log, they line up. So, I'm hopeful this is on the
> right track.
> 
> DMA/IOMMU trace: https://cmphys.com/ath12k/iommu_dma_trace-20250701.log
> Kernel log with additional logging:
> https://cmphys.com/ath12k/dmesg-6.12.35-20250701.log
> Diff showing extra logging added to v6.12.35:
> https://cmphys.com/ath12k/ath12k-extra-logging-6.12.35-20250701.diff

Thanks, the log is helpful.

So the whole thing is:

#1 ath12k allocates/maps it at a very early stage:
   (udev-worker)-532     [010] .....     4.878076: map: IOMMU: iova=0x00000000fe980000 -
0x00000000fea00000 paddr=0x000000010ec80000 size=524288
   (udev-worker)-532     [010] .....     4.878079: dma_alloc: 0000:c2:00.0
dma_addr=fe980000 size=524288 virt_addr=000000006cadbcb1 flags=GFP_KERNEL attrs=

#2 here it is unmapped/freed
   kworker/u64:0-12      [011] .....   327.747763: dma_free: 0000:c2:00.0
dma_addr=fe980000 size=524288 virt_addr=000000006cadbcb1 attrs=
   kworker/u64:0-12      [011] .....   327.747766: unmap: IOMMU: iova=0x00000000fe980000 -
0x00000000fea00000 size=524288 unmapped_size=524288

#3 then the page fault
irq/26-AMD-Vi-154     [006] .....   327.753942: io_page_fault: IOMMU:ath12k_pci
0000:c2:00.0 iova=0x00000000fe980000 flags=0x0001

#4 here seems ath12k is recovering
[  327.849022] mhi mhi0: Requested to power ON

This gives me the impression that the IOMMU page fault is caused by misbehaved firmware
which crashes. The sequence is, first firmware crashes, then host gets that event and
begins to recover, during which some DMA buffer is freed/unmapped. However the firmware
does not know that and continues to access it, and hence the page fault.

Matt, could you help enable verbose ath12k log to verify my guess?

modprobe ath12k debug_mask=0xffffffff

note this would make ath12k throws lots of logs. Here the purpose is to check whether
firmware crash happens before the page fault. You may monitor

	ath12k_dbg(ab, ATH12K_DBG_BOOT, "reset starting\n");

in ath12k_core_reset(), which is the entry of ath12k recovery process.

And one more thing, the issue buffer is handled by dma_alloc/free_xxx API family, so
adding logs to dma_map/unmap_xxx API does not help here.

> 
> Thanks,
> Matt