Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)

Sun Nov 30 22:24:59 PST 2025

Hi Quentin,

On 11/29/2025 2:41 AM, Quentin Schulz wrote:
> Hi all,
> 
> I've finally found the time to test the RK3588 NPU (aka Rocket, upstream) on our products and I've been hit with Oopses when loosely following the instructions from https://docs.mesa3d.org/teflon.html. After debugging for a while this is often (not always!) reproducible when DRM_ACCEL_ROCKET=y (but not when built as a module!). The fact it's "not always" is a dead giveaway timing or probe order is involved, more later on that.
> 

This looks like a power domain problem. What would happen if you tried
enabling the regulator for the NPU power domain in U-Boot?

Or did you properly configure the corresponding regulator supply in
the power domain?

> I don't think the Oops is important as I believe it's a symptom of other issues, so I won't post it.
> 
> I'll list the issue I've identified so far, some seem easier to fix than others (the biggest issue of course being the hardest to fix).
> 
> 1. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_core.c#L79
> 
> The error path after a failed pm_runtime_resume_and_get() probably should be calling rocket_core_fini(core) and not rocket_job_fini() (the former calls the latter) as to unwind everything that's been done in rocket_core_init() before this failed.
> 
> The PM core complains about unbalanced something when the core fails to init and is retried later.
> 
> I suggest the following:
> 
> """
> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index abe7719c1db46..95a732ab418c3 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
> @@ -76,7 +76,7 @@ int rocket_core_init(struct rocket_core *core)
> 
>      err = pm_runtime_resume_and_get(dev);
>      if (err) {
> -        rocket_job_fini(core);
> +        rocket_core_fini(core);
>          return err;
>      }
> 
> """
> 
> I can send a patch for that.
> 
> 2. rocket_probe() (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L159) can result in out-of-bounds access when rocket_core_init() fails as rdev is a global variable (so rdev->num_cores persists between tries). Considering rdev->cores is allocated based on the number of available nodes in the DTB with the rknn compatible, if rocket_probe() is called more than the number of available nodes, rdev->cores[core] will result in out-of-bounds access.
> 
> I believe we could do:
> 
> """
> diff --git a/drivers/accel/rocket/rocket_drv.c b/drivers/accel/rocket/rocket_drv.c
> index 5c0b63f0a8f00..ad602ebd17c44 100644
> --- a/drivers/accel/rocket/rocket_drv.c
> +++ b/drivers/accel/rocket/rocket_drv.c
> @@ -158,6 +158,8 @@ static const struct drm_driver rocket_drm_driver = {
> 
>  static int rocket_probe(struct platform_device *pdev)
>  {
> +    int ret;
> +
>      if (rdev == NULL) {
>          /* First core probing, initialize DRM device. */
>          rdev = rocket_device_init(drm_dev, &rocket_drm_driver);
> @@ -177,7 +179,15 @@ static int rocket_probe(struct platform_device *pdev)
> 
>      rdev->num_cores++;
> 
> -    return rocket_core_init(&rdev->cores[core]);
> +    ret = rocket_core_init(&rdev->cores[core]);
> +    if (ret) {
> +        rdev->num_cores--;
> +        if (rdev->num_cores == 0) {
> +            rocket_device_fini(rdev);
> +        }
> +    }
> +
> +    return ret;
>  }
> 
>  static void rocket_remove(struct platform_device *pdev)
> 
> """
> 

It seems hacky. I'm not sure whether this would be better
implemented as a component driver.

> I can send a patch for that.
> 
> 3. I believe (but I may be wrong) that core 0 is special (do we need to schedule through core 0 maybe?). There's a hardcoded fetch of the device associated with core 0 via rocket_iommu_domain_create(rdev->cores[0].dev); (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/accel/rocket/rocket_drv.c#L88). The issue is that this assumes that core 0 is always rdev->cores[0], which may not be true as it depends on the probe order (which is not guaranteed), which can be exacerbated by core 0 (HW index) probe being deferred for some reason. Is it necessary for core 0 to be the one initializing the DRM device or could we simply use any core for that (just the first to probe, whichever it actually is HW-wise)?

I think this is because multiple cores share the same IOMMU device.

> 
> If there's a need to know which of the three cores is core 0, we can have a vendor property for it or maybe aliases for the cores? We can also do a check on the base register address of the core in DT and then do a mapping in the driver. Any other suggestion?
> 
> 4. The actual issue I'm unable to suggest something for comes now. The following is all happening in the context of DRM_ACCEL_ROCKET being built-in (that is DRM_ACCEL_ROCKET=y). You can use aarch64's default defconfig, then DRM=y, DRM_ACCEL=y and DRM_ACCEL_ROCKET=y via menuconfig. Remember that this is not 100% reproducible (more often happening than not).
> 
> Points 1, 2 and 3 only are an issue if you hit some error-path during the probe of one of the cores, which is triggered by this point 4 here.
> 
> rknn_core_0 depends on rknn_mmu_0 as highlighted by the iommus phandle property. rknn_mmu_0 depends on RK3588_PD_NPUTOP power-domain. The parent domain (RK3588_PD_NPU) of RK3588_PD_NPUTOP power-domain has a domain-supply on a regulator on i2c (i2c0 at 42 for Jaguar, patches on the ML at https://lore.kernel.org/linux-rockchip/20250812085213.1071106-2-heiko@sntech.de/). When the rocket driver calls platform_driver_register() in its module_init() function, the kernel core will try to probe the associated device, via driver_probe_device(), then __driver_probe_device() and more specifically pm_runtime_get_suppliers(dev) in there (https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/dd.c#L793). Then this calls pm_runtime_get_sync() on all links, c.f. https://elixir.bootlin.com/linux/v6.18-rc7/source/drivers/base/power/runtime.c#L1914. Note that pm_runtime_get_suppliers() returns void. One of the links is the rknn_mmu_0.
> 
> Follow the next series of function calls, pm_runtime_get_sync() on rknn_mmu_0, rpm_resume(), rpm_callback(), __rpm_callback(). This then calls cb(). cb() is set via RPM_GET_CALLBACK(dev, runtime_resume). Due to rknn_mmu_0 device being probed with dev_pm_domain_attach() attaching the power-domain device to iommu via the pm_domain member, genpd_runtime_resume() (which goes to rockchip_pd_power_on()) is actually called. This then calls rockchip_pd_regulator_enable() which returns -EPROBE_DEFER. Going up in the call stack up to rpm_callback() you can see that retval = __rpm_callback(cb, dev); now is EPROBE_DEFER and thus, dev->power.runtime_error = EPROBE_DEFER at the end of the function.
> 
> Then once pm_runtime_resume_and_get(dev); is called in rocket_core_init() during probe of the rocket device, it'll call into rpm_get_suppliers() in __rpm_callback() which in turn will call pm_runtime_get_sync() on rknn_mmu_0 which will go up to rpm_resume() where dev->power.runtime_error will be checked and if non-zero will return -EINVAL, effectively eating -EPROBE_DEFER and failing the probe of the rocket device instead of deferring the probe.
> 
> Applying the below patch:
> 
> """
> diff --git a/drivers/accel/rocket/rocket_core.c b/drivers/accel/rocket/rocket_core.c
> index abe7719c1db46..c4692189bbfc6 100644
> --- a/drivers/accel/rocket/rocket_core.c
> +++ b/drivers/accel/rocket/rocket_core.c
> @@ -72,22 +72,9 @@ int rocket_core_init(struct rocket_core *core)
>       */
>      pm_runtime_set_autosuspend_delay(dev, 50);
> 
> +    pm_runtime_set_suspended(dev);
>      pm_runtime_enable(dev);
> 
> -    err = pm_runtime_resume_and_get(dev);
> -    if (err) {
> -        rocket_job_fini(core);
> -        return err;
> -    }
> -
> -    version = rocket_pc_readl(core, VERSION);
> -    version += rocket_pc_readl(core, VERSION_NUM) & 0xffff;
> -
> -    pm_runtime_mark_last_busy(dev);
> -    pm_runtime_put_autosuspend(dev);
> -
> -    dev_info(dev, "Rockchip NPU core %d version: %d\n", core->index, version);
> -
>      return 0;
>  }
> 
> """
> 
> seems to make it possible to run mesa's example. If pm_runtime_set_suspended(dev); isn't added, we'll only ever get job timeouts.
> 
> Note that I'm pretty sure this isn't the proper way to do it but wanted to give some of my debugging attempts at that.
> 
> I'm not sure how much further I can go and specifically I have no clue how to handle that properly. Please also verify all the claims I'm making, I'm not entirely sure I got every call stack right, it is quite convoluted and I don't have experience with those frameworks.
> 
> Does anyone have a clue how we can deal with that?
> 
> The driver built as a module doesn't have this issue because the regulator has enough time to probe so that PD doesn't return EPROBE_DEFER when the rocket device gets probed once reaching userspace (where the module is). I'm not entirely sure if this could be reproduced by having the regulator driver as a module too for example?
> 
> Cheers,
> Quentin
> 
> _______________________________________________
> Linux-rockchip mailing list
> Linux-rockchip at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-rockchip

-- 
Best, 
Chaoyi