[PATCH rc v2 0/5] iommu/arm-smmu-v3: Fix device crash on kdump kernel

Tian, Kevin kevin.tian at intel.com
Fri Apr 17 00:48:46 PDT 2026


> From: Jason Gunthorpe <jgg at nvidia.com>
> Sent: Friday, April 17, 2026 1:20 AM
> 
> On Thu, Apr 16, 2026 at 05:49:24PM +0100, Robin Murphy wrote:
> > On 15/04/2026 10:17 pm, Nicolin Chen wrote:
> > > When transitioning to a kdump kernel, the primary kernel might have
> crashed
> > > while endpoint devices were actively bus-mastering DMA. Currently, the
> SMMU
> > > driver aggressively resets the hardware during probe by clearing
> CR0_SMMUEN
> > > and setting the Global Bypass Attribute (GBPA) to ABORT.
> > >
> > > In a kdump scenario, this aggressive reset is highly destructive:
> > > a) If GBPA is set to ABORT, in-flight DMA will be aborted, generating fatal
> > >     PCIe AER or SErrors that may panic the kdump kernel
> > > b) If GBPA is set to BYPASS, in-flight DMA targeting some IOVAs will
> bypass
> > >     the SMMU and corrupt the physical memory at those 1:1 mapped
> IOVAs.
> >
> > But wasn't that rather the point? Th kdump kernel doesn't know the scope
> of
> > how much could have gone wrong (including potentially the SMMU
> configuration
> > itself), so it just blocks everything, resets and reenables the devices it
> > cares about, and ignores whatever else might be on fire.
> 
> The purpose of kdump is to have the maximum chance to capture a dump
> from the blown up kernel.
> 
> Yes, on a perfect platform aborting the entire SMMU should improve the
> chance of getting that dump.
> 
> But sadly there are so many busted up platforms where if you start
> messing with the IOMMU they will explode and blow up the kdump. x86
> and "firmware first" error handling systems are particularly notorious
> for nasty behavior like this.
> 
> Seems like there are now ARM systems too. :(

is there any report on such systems? It might be informational to include
a link to the report so it's clear that this series fixes real issues instead of
a preparation for coming systems...

> 
> So, the iommu drivers have been preserving the IOMMU and not
> disrupting the DMAs on x86 for a long time. This is established kdump
> practice.
> 
> > If AER can panic a kdump kernel, that seems like a failing of the kdump
> > kernel itself more than anything else (especially given the likelihood that
> > additional AER events could follow from whatever initial crash/failure
> > triggered kdump to begin with).
> 
> Probably the kdump wasn't triggered by AER. You want kdump to not
> trigger more RAS events that might blow up the kdump while it is
> trying to run.. That increases the chance of success
> 

btw the DMA is allowed after the previous kernel is hung til the point
where smmu driver blocks it. In cases where in-fly DMAs are considered
dangerous to kdump, this series just make it worse instead of creating
a new issue. While for majority other failures not related to DMAs, 
unblocking then increases the chance of success...



More information about the linux-arm-kernel mailing list