[PATCH v2 4/4] iommu: Get DT/ACPI parsing into the proper probe path

Louis-Alexis Eyraud louisalexis.eyraud at collabora.com
Thu Mar 27 04:00:14 PDT 2025


Hi,

On Thu, 2025-03-27 at 17:47 +0800, Chen-Yu Tsai wrote:
> Hi,
> 
> On Fri, Feb 28, 2025 at 03:46:33PM +0000, Robin Murphy wrote:
> > In hindsight, there were some crucial subtleties overlooked when
> > moving
> > {of,acpi}_dma_configure() to driver probe time to allow waiting for
> > IOMMU drivers with -EPROBE_DEFER, and these have become an
> > ever-increasing source of problems. The IOMMU API has some
> > fundamental
> > assumptions that iommu_probe_device() is called for every device
> > added
> > to the system, in the order in which they are added. Calling it in
> > a
> > random order or not at all dependent on driver binding leads to
> > malformed groups, a potential lack of isolation for devices with no
> > driver, and all manner of unexpected concurrency and race
> > conditions.
> > We've attempted to mitigate the latter with point-fix bodges like
> > iommu_probe_device_lock, but it's a losing battle and the time has
> > come
> > to bite the bullet and address the true source of the problem
> > instead.
> > 
> > The crux of the matter is that the firmware parsing actually serves
> > two
> > distinct purposes; one is identifying the IOMMU instance associated
> > with
> > a device so we can check its availability, the second is actually
> > telling that instance about the relevant firmware-provided data for
> > the
> > device. However the latter also depends on the former, and at the
> > time
> > there was no good place to defer and retry that separately from the
> > availability check we also wanted for client driver probe.
> > 
> > Nowadays, though, we have a proper notion of multiple IOMMU
> > instances in
> > the core API itself, and each one gets a chance to probe its own
> > devices
> > upon registration, so we can finally make that work as intended for
> > DT/IORT/VIOT platforms too. All we need is for iommu_probe_device()
> > to
> > be able to run the iommu_fwspec machinery currently buried deep in
> > the
> > wrong end of {of,acpi}_dma_configure(). Luckily it turns out to be
> > surprisingly straightforward to bootstrap this transformation by
> > pretty
> > much just calling the same path twice. At client driver probe time,
> > dev->driver is obviously set; conversely at device_add(), or a
> > subsequent bus_iommu_probe(), any device waiting for an IOMMU
> > really
> > should *not* have a driver already, so we can use that as a
> > condition to
> > disambiguate the two cases, and avoid recursing back into the IOMMU
> > core
> > at the wrong times.
> > 
> > Obviously this isn't the nicest thing, but for now it gives us a
> > functional baseline to then unpick the layers in between without
> > many
> > more awkward cross-subsystem patches. There are some minor side-
> > effects
> > like dma_range_map potentially being created earlier, and some
> > debug
> > prints being repeated, but these aren't significantly detrimental.
> > Let's
> > make things work first, then deal with making them nice.
> > 
> > With the basic flow finally in the right order again, the next step
> > is
> > probably turning the bus->dma_configure paths inside-out, since all
> > we
> > really need from bus code is its notion of which device and input
> > ID(s)
> > to parse the common firmware properties with...
> > 
> > Acked-by: Bjorn Helgaas <bhelgaas at google.com> # pci-driver.c
> > Acked-by: Rob Herring (Arm) <robh at kernel.org> # of/device.c
> > Signed-off-by: Robin Murphy <robin.murphy at arm.com>
> 
> This change causes the MediaTek IOMMU driver to panic on probe,
> resulting in multiple MediaTek platforms not being able to boot.
> This was observed on Linus's tree at commit 1a9239bb4253
> ("Merge tag 'net-next-6.15' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")
> which just received the IOMMU updates a couple merge commits
> prior.
> 
> The regression was bisected down to this patch on MT8183 Juniper
> Chromebook. The error is a NULL pointer dereference. Here's the
> decoded backtrace:
> 
>     Unable to handle kernel read from unreadable memory at virtual
> address 0000000000000000
>     Mem abort info:
>       ESR = 0x0000000096000005
>     usb 1-1.1.2: new high-speed USB device number 6 using xhci-mtk
>       EC = 0x25: DABT (current EL), IL = 32 bits
>       SET = 0, FnV = 0
>       EA = 0, S1PTW = 0
>       FSC = 0x05: level 1 translation fault
>     Data abort info:
>       ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
>       CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>       GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
>     [0000000000000000] user address but active_mm is swapper
>     Internal error: Oops: 0000000096000005 [#1]  SMP
>     Modules linked in:
>     CPU: 4 UID: 0 PID: 68 Comm: kworker/u34:1 Tainted: G   
> B              6.14.0-05877-g1a9239bb4253 #621 PREEMPT 
> a6631d3f04612a5c23866dc67cf38316c6b023e0
>     Tainted: [B]=BAD_PAGE
>     Hardware name: Google juniper sku16 board (DT)
>     Workqueue: events_unbound deferred_probe_work_func
>     pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>     pc : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940)
>     lr : mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940)
>     sp : ffffffc0809674a0
>     x29: ffffffc0809674a0 x28: ffffff80c1f0f078 x27: ffffff80c1f0f460
>     x26: ffffff80c1f0f458 x25: ffffffdb1b738788 x24: ffffffc0809676d0
>     x23: ffffff80c8fe4130 x22: ffffffffffffffed x21: 0000000000000000
>     x20: ffffff80c1f0f010 x19: ffffff80c8fd7d00 x18: 0000000000000000
>     x17: 000000040044ffff x16: 00500072b5593510 x15: 0000000000000000
>     x14: ffffff80c09fbb80 x13: ffffffa5bf03f000 x12: ffffffbb6399cdf1
>     x11: 1ffffffb6399cdf0 x10: ffffffbb6399cdf0 x9 : dfffffc000000000
>     x8 : 000000449c663210 x7 : ffffffdb1cce6f87 x6 : 0000000000000001
>     x5 : ffffffdb1cce6f80 x4 : ffffffbb6399cdf1 x3 : 0000000000000000
>     x2 : 0000000000000020 x1 : ffffff80c22ed940 x0 : 0000000000000001
>     Call trace:
>     mtk_iommu_device_group (drivers/iommu/mtk_iommu.c:368
> drivers/iommu/mtk_iommu.c:940) (P)
>     __iommu_probe_device (drivers/iommu/iommu.c:461
> drivers/iommu/iommu.c:563)
>     probe_iommu_group (drivers/iommu/iommu.c:1722)
>     bus_for_each_dev (drivers/base/bus.c:370)
>     iommu_device_register (drivers/iommu/iommu.c:1875
> drivers/iommu/iommu.c:276)
>     mtk_iommu_probe (drivers/iommu/mtk_iommu.c:1380)
>     platform_probe (drivers/base/platform.c:1404)
>     really_probe (drivers/base/dd.c:579 drivers/base/dd.c:658)
>     __driver_probe_device (drivers/base/dd.c:800)
>     driver_probe_device (drivers/base/dd.c:830)
>     __device_attach_driver (drivers/base/dd.c:959)
>     bus_for_each_drv (drivers/base/bus.c:462)
>     __device_attach (drivers/base/dd.c:1032)
>     device_initial_probe (drivers/base/dd.c:1080)
>     bus_probe_device (drivers/base/bus.c:537)
>     deferred_probe_work_func (drivers/base/dd.c:124)
>     process_one_work (./arch/arm64/include/asm/jump_label.h:36
> ./include/trace/events/workqueue.h:110 kernel/workqueue.c:3243)
>     worker_thread (kernel/workqueue.c:3313 (discriminator 2)
> kernel/workqueue.c:3400 (discriminator 2))
>     kthread (kernel/kthread.c:464)
>     ret_from_fork (arch/arm64/kernel/entry.S:863)
>     Code: 92800256 f940d2b5 aa1503e0 97e8dbd7 (f94002b5)
>     All code
>     ========
>        0:	92800256 	mov	x22,
> #0xffffffffffffffed    	// #-19
>        4:	f940d2b5 	ldr	x21, [x21, #416]
>        8:	aa1503e0 	mov	x0, x21
>        c:	97e8dbd7 	bl	0xffffffffffa36f68
>       10:*	f94002b5 	ldr	x21, [x21]		<--
> trapping instruction
> 
>     Code starting with the faulting instruction
>     ===========================================
>        0:	f94002b5 	ldr	x21, [x21]
>     ---[ end trace 0000000000000000 ]---
> 
> 
> ChenYu
> 
I've sent a patch [1] to fix this issue that I've also observed on
my Mediatek Genio boards (510-EVK and 1200-EVK).

[1]:https://lore.kernel.org/linux-iommu/20250327-fix-mtk-iommu-error-v1-1-df969158e752@collabora.com/

Regards,
Louis-Alexis

> > ---
> > 
> > v2:
> >  - Comment bus driver changes for clarity
> >  - Use dev->iommu as the now-robust replay condition
> >  - Drop the device_iommu_mapped() checks in the firmware paths as
> > they
> >    weren't doing much - we can't replace probe_device_lock just
> > yet...
> >  
> >  drivers/acpi/arm64/dma.c        |  5 +++++
> >  drivers/acpi/scan.c             |  7 -------
> >  drivers/amba/bus.c              |  3 ++-
> >  drivers/base/platform.c         |  3 ++-
> >  drivers/bus/fsl-mc/fsl-mc-bus.c |  3 ++-
> >  drivers/cdx/cdx.c               |  3 ++-
> >  drivers/iommu/iommu.c           | 24 +++++++++++++++++++++---
> >  drivers/iommu/of_iommu.c        |  7 ++++++-
> >  drivers/of/device.c             |  7 ++++++-
> >  drivers/pci/pci-driver.c        |  3 ++-
> >  10 files changed, 48 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/acpi/arm64/dma.c b/drivers/acpi/arm64/dma.c
> > index 52b2abf88689..f30f138352b7 100644
> > --- a/drivers/acpi/arm64/dma.c
> > +++ b/drivers/acpi/arm64/dma.c
> > @@ -26,6 +26,11 @@ void acpi_arch_dma_setup(struct device *dev)
> >  	else
> >  		end = (1ULL << 32) - 1;
> >  
> > +	if (dev->dma_range_map) {
> > +		dev_dbg(dev, "dma_range_map already set\n");
> > +		return;
> > +	}
> > +
> >  	ret = acpi_dma_get_range(dev, &map);
> >  	if (!ret && map) {
> >  		end = dma_range_map_max(map);
> > diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
> > index 9f4efa8f75a6..fb1fe9f3b1a3 100644
> > --- a/drivers/acpi/scan.c
> > +++ b/drivers/acpi/scan.c
> > @@ -1632,13 +1632,6 @@ static int acpi_iommu_configure_id(struct
> > device *dev, const u32 *id_in)
> >  		err = viot_iommu_configure(dev);
> >  	mutex_unlock(&iommu_probe_device_lock);
> >  
> > -	/*
> > -	 * If we have reason to believe the IOMMU driver missed
> > the initial
> > -	 * iommu_probe_device() call for dev, replay it to get
> > things in order.
> > -	 */
> > -	if (!err && dev->bus)
> > -		err = iommu_probe_device(dev);
> > -
> >  	return err;
> >  }
> >  
> > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> > index 8ef259b4d037..71482d639a6d 100644
> > --- a/drivers/amba/bus.c
> > +++ b/drivers/amba/bus.c
> > @@ -364,7 +364,8 @@ static int amba_dma_configure(struct device
> > *dev)
> >  		ret = acpi_dma_configure(dev, attr);
> >  	}
> >  
> > -	if (!ret && !drv->driver_managed_dma) {
> > +	/* @drv may not be valid when we're called from the IOMMU
> > layer */
> > +	if (!ret && dev->driver && !drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index 6f2a33722c52..1813cfd0c4bd 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -1451,7 +1451,8 @@ static int platform_dma_configure(struct
> > device *dev)
> >  		attr =
> > acpi_get_dma_attr(to_acpi_device_node(fwnode));
> >  		ret = acpi_dma_configure(dev, attr);
> >  	}
> > -	if (ret || drv->driver_managed_dma)
> > +	/* @drv may not be valid when we're called from the IOMMU
> > layer */
> > +	if (ret || !dev->driver || drv->driver_managed_dma)
> >  		return ret;
> >  
> >  	ret = iommu_device_use_default_domain(dev);
> > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-
> > mc/fsl-mc-bus.c
> > index d1f3d327ddd1..a8be8cf246fb 100644
> > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> > @@ -153,7 +153,8 @@ static int fsl_mc_dma_configure(struct device
> > *dev)
> >  	else
> >  		ret = acpi_dma_configure_id(dev, DEV_DMA_COHERENT,
> > &input_id);
> >  
> > -	if (!ret && !mc_drv->driver_managed_dma) {
> > +	/* @mc_drv may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !mc_drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
> > index c573ed2ee71a..780fb0c4adba 100644
> > --- a/drivers/cdx/cdx.c
> > +++ b/drivers/cdx/cdx.c
> > @@ -360,7 +360,8 @@ static int cdx_dma_configure(struct device
> > *dev)
> >  		return ret;
> >  	}
> >  
> > -	if (!ret && !cdx_drv->driver_managed_dma) {
> > +	/* @cdx_drv may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !cdx_drv->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index a3b45b84f42b..1cec7074367a 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -414,9 +414,21 @@ static int iommu_init_device(struct device
> > *dev)
> >  	if (!dev_iommu_get(dev))
> >  		return -ENOMEM;
> >  	/*
> > -	 * For FDT-based systems and ACPI IORT/VIOT, drivers
> > register IOMMU
> > -	 * instances with non-NULL fwnodes, and client devices
> > should have been
> > -	 * identified with a fwspec by this point. Otherwise, we
> > can currently
> > +	 * For FDT-based systems and ACPI IORT/VIOT, the common
> > firmware parsing
> > +	 * is buried in the bus dma_configure path. Properly
> > unpicking that is
> > +	 * still a big job, so for now just invoke the whole
> > thing. The device
> > +	 * already having a driver bound means dma_configure has
> > already run and
> > +	 * either found no IOMMU to wait for, or we're in its
> > replay call right
> > +	 * now, so either way there's no point calling it again.
> > +	 */
> > +	if (!dev->driver && dev->bus->dma_configure) {
> > +		mutex_unlock(&iommu_probe_device_lock);
> > +		dev->bus->dma_configure(dev);
> > +		mutex_lock(&iommu_probe_device_lock);
> > +	}
> > +	/*
> > +	 * At this point, relevant devices either now have a
> > fwspec which will
> > +	 * match ops registered with a non-NULL fwnode, or we can
> > reasonably
> >  	 * assume that only one of Intel, AMD, s390, PAMU or
> > legacy SMMUv2 can
> >  	 * be present, and that any of their registered instances
> > has suitable
> >  	 * ops for probing, and thus cheekily co-opt the same
> > mechanism.
> > @@ -426,6 +438,12 @@ static int iommu_init_device(struct device
> > *dev)
> >  		ret = -ENODEV;
> >  		goto err_free;
> >  	}
> > +	/*
> > +	 * And if we do now see any replay calls, they would
> > indicate someone
> > +	 * misusing the dma_configure path outside bus code.
> > +	 */
> > +	if (dev->driver)
> > +		dev_WARN(dev, "late IOMMU probe at driver bind,
> > something fishy here!\n");
> >  
> >  	if (!try_module_get(ops->owner)) {
> >  		ret = -EINVAL;
> > diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
> > index e10a68b5ffde..6b989a62def2 100644
> > --- a/drivers/iommu/of_iommu.c
> > +++ b/drivers/iommu/of_iommu.c
> > @@ -155,7 +155,12 @@ int of_iommu_configure(struct device *dev,
> > struct device_node *master_np,
> >  		dev_iommu_free(dev);
> >  	mutex_unlock(&iommu_probe_device_lock);
> >  
> > -	if (!err && dev->bus)
> > +	/*
> > +	 * If we're not on the iommu_probe_device() path (as
> > indicated by the
> > +	 * initial dev->iommu) then try to simulate it. This
> > should no longer
> > +	 * happen unless of_dma_configure() is being misused
> > outside bus code.
> > +	 */
> > +	if (!err && dev->bus && !dev_iommu_present)
> >  		err = iommu_probe_device(dev);
> >  
> >  	if (err && err != -EPROBE_DEFER)
> > diff --git a/drivers/of/device.c b/drivers/of/device.c
> > index edf3be197265..5053e5d532cc 100644
> > --- a/drivers/of/device.c
> > +++ b/drivers/of/device.c
> > @@ -99,6 +99,11 @@ int of_dma_configure_id(struct device *dev,
> > struct device_node *np,
> >  	bool coherent, set_map = false;
> >  	int ret;
> >  
> > +	if (dev->dma_range_map) {
> > +		dev_dbg(dev, "dma_range_map already set\n");
> > +		goto skip_map;
> > +	}
> > +
> >  	if (np == dev->of_node)
> >  		bus_np = __of_get_dma_parent(np);
> >  	else
> > @@ -119,7 +124,7 @@ int of_dma_configure_id(struct device *dev,
> > struct device_node *np,
> >  		end = dma_range_map_max(map);
> >  		set_map = true;
> >  	}
> > -
> > +skip_map:
> >  	/*
> >  	 * If @dev is expected to be DMA-capable then the bus code
> > that created
> >  	 * it should have initialised its dma_mask pointer by this
> > point. For
> > diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> > index f57ea36d125d..4468b7327cab 100644
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> > @@ -1653,7 +1653,8 @@ static int pci_dma_configure(struct device
> > *dev)
> >  
> >  	pci_put_host_bridge_device(bridge);
> >  
> > -	if (!ret && !driver->driver_managed_dma) {
> > +	/* @driver may not be valid when we're called from the
> > IOMMU layer */
> > +	if (!ret && dev->driver && !driver->driver_managed_dma) {
> >  		ret = iommu_device_use_default_domain(dev);
> >  		if (ret)
> >  			arch_teardown_dma_ops(dev);
> > -- 
> > 2.39.2.101.g768bb238c484.dirty
> > 
> 




More information about the linux-arm-kernel mailing list