[PATCH v2] iommu/mediatek: Fix NULL pointer deference in mtk_iommu_device_group
Robin Murphy
robin.murphy at arm.com
Thu Apr 10 04:14:48 PDT 2025
On 08/04/2025 5:46 am, Chen-Yu Tsai wrote:
> On Mon, Apr 7, 2025 at 8:38 PM Robin Murphy <robin.murphy at arm.com> wrote:
>>
>> On 2025-04-07 6:17 am, Chen-Yu Tsai wrote:
>>> Hi,
>>>
>>> On Thu, Apr 3, 2025 at 6:24 PM Louis-Alexis Eyraud
>>> <louisalexis.eyraud at collabora.com> wrote:
>>>>
>>>> Currently, mtk_iommu calls during probe iommu_device_register before
>>>> the hw_list from driver data is initialized. Since iommu probing issue
>>>> fix, it leads to NULL pointer dereference in mtk_iommu_device_group when
>>>> hw_list is accessed with list_first_entry (not null safe).
>>>>
>>>> So, change the call order to ensure iommu_device_register is called
>>>> after the driver data are initialized.
>>>>
>>>> Fixes: 9e3a2a643653 ("iommu/mediatek: Adapt sharing and non-sharing pgtable case")
>>>> Fixes: bcb81ac6ae3c ("iommu: Get DT/ACPI parsing into the proper probe path")
>>>> Reviewed-by: Yong Wu <yong.wu at mediatek.com>
>>>> Tested-by: Chen-Yu Tsai <wenst at chromium.org> # MT8183 Juniper, MT8186 Tentacruel
>>>> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno at collabora.com>
>>>> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno at collabora.com>
>>>> Signed-off-by: Louis-Alexis Eyraud <louisalexis.eyraud at collabora.com>
>>>> ---
>>>> This patch fixes a NULL pointer dereference that occurs during the
>>>> mtk_iommu driver probe and observed at least on several Mediatek Genio boards:
>>>> ```
>>>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>>>
>>> This is a reminder to please land this and send to Linus ASAP.
>>>
>>> This fixes the v6.15-rc1 kernel on all the MediaTek Chromebook platforms,
>>> except for MT8188, which seems to have another issue in iommu_get_dma_domain()
>>> used from the DRM driver:
>>>
>>> Disabling lock debugging due to kernel taint
>>> Unable to handle kernel NULL pointer dereference at virtual
>>> address 0000000000000158
>>
>> From the offset and the stacktrace code dump, this would appear to be
>> the dereference of dev->iommu_group->default_domain, when
>> dev->iommu_group is NULL (and CONFIG_DEBUG_LOCK_ALLOC makes the mutex
>> really big). Which is a bit weird, as to get into iommu-dma at all in
>> that state would suggest that whatever device this is has been removed
>> and had its group torn down again after iommu_setup_dma_ops() has run...
>> but either way that implies the DRM driver is passing an arbitrary
>> device to the DMA API without making sure it's actualy valid.
>>
>> Trying to trace the provenance of dma_dev from mtk_gem_create() back
>> through the rest of the driver is quite the rabbit-hole, but it seems
>> like in at least one case it can lead back to an
>> of_find_device_by_node() in ovl_adaptor_comp_init(), which definitely
>> looks sufficiently sketchy.
>
> It kind of makes sense since the "display controller" is composed of
> many individual hardware blocks. The struct device tied to the DRM
> driver is more or less just a place holder. Only the first block,
> either the OVL (overlay compositing engine) or RDMA (scanout engine)
> accesses memory, so I think it makes sense to use that as the dma_dev.
I'm not disputing whether the choice of dma_dev is semantically
appropriate, I'm just saying that the method of pulling a struct device
reference out of the DT topology shows no *obvious* guarantee that that
specific device will have a driver bound and be validly configured
before that dma_dev can be used via any other path. Especially when it's
all happening off the back of another fake device created by the fake
DRM device itself.
Maybe there's some hidden magic in all the component stuff which makes
it work out fine, I don't know. This was just an observation since I
went looking for potential bugs and found something which at first
glance *looks* rather fragile, compared to, say, if mtk_mdp_rdma's own
probe() or bind() were to directly set itself as the DMA device.
> With some more logs, I did find something else fishy. Here the IOMMU
> for the second display pipeline fails to probe:
>
> mtk-iommu 1c028000.iommu: error -EINVAL: Failed to register IOMMU
> mtk-iommu 1c028000.iommu: probe with driver mtk-iommu failed with error -22
>
> Then later on, deferred probe times out, and the display pipeline is
> brought up regardless:
>
> mediatek-disp-ovl 1c000000.ovl: deferred probe timeout, ignoring dependency
> mediatek-disp-ovl 1c000000.ovl: Adding to IOMMU failed: -110
> mediatek-disp-rdma 1c002000.rdma: deferred probe timeout, ignoring
> dependency
> mediatek-disp-rdma 1c002000.rdma: Adding to IOMMU failed: -110
> (repeats for all the individual components of the display pipeline)
> mediatek-drm mediatek-drm.16.auto: bound 1c000000.ovl (ops
> mtk_disp_ovl_component_ops)
> mediatek-drm mediatek-drm.16.auto: bound 1c002000.rdma (ops
> mtk_disp_rdma_component_ops)
> (repeats for all the individual components of the display pipeline)
> mediatek-drm mediatek-drm.16.auto: DMA device is 1c000000.ovl
> [drm] Initialized mediatek 1.0.0 for mediatek-drm.16.auto on minor 1
>
> And all without a functional IOMMU.
Ah, that I was not expecting, and it does indeed explain how things get
into that state.
> So I think this brings up two more questions:
>
> 1. Why is the IOMMU failing to probe?
> 2. Why is the core code still going the IOMMU DMA alloc path if there
> is no usable IOMMU?
>
> I'll look into the first question first. Insights welcome for the second
> one.
I had a think about it, and although there's very much a historical mess
in this area, I'm inclined to agree that we could do better. Patch sent.
Cheers,
Robin.
More information about the Linux-mediatek
mailing list