[PATCH 1/6] pmdomain: rockchip: fix rockchip_pd_power error handling

Peter Geis pgwipeout at gmail.com
Tue Dec 10 12:12:36 PST 2024


On Tue, Dec 10, 2024 at 3:18 AM Dragan Simic <dsimic at manjaro.org> wrote:
>
> Hello Peter,
>
> On 2024-12-10 02:30, Peter Geis wrote:
> > The calls rockchip_pd_power makes to rockchip_pmu_set_idle_request lack
> > any return error handling, causing device drivers to incorrectly
> > believe
> > the hardware idle requests succeed when they may have failed. This
> > leads
> > to software possibly accessing hardware that is powered off and the
> > subsequent SError panic that follows.
> >
> > Add error checking and return errors to the calling function to prevent
> > such crashes.
> >
> > gst-launch-1.0 videotestsrc num-buffers=2000 ! v4l2jpegenc ! fakesink
> > Setting pipeline to PAUSED ...er-x64
> > Pipeline is PREROLLING ...
> > Redistribute latency...
> > rockchip-pm-domain ff100000.syscon:power-controller: failed to get ack
> > on
> > domain 'hevc', val=0x98260, idle = 0
> > SError Interrupt on CPU2, code 0x00000000bf000002 -- SError
> > CPU: 2 UID: 0 PID: 804 Comm: videotestsrc0:s Not tainted 6.12.0-rc5+
> > #54
> > Hardware name: Firefly roc-rk3328-cc (DT)
> > pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > pc : rockchip_vpu2_jpeg_enc_run+0x168/0xbc8
> > lr : device_run+0xb0/0x128
> > sp : ffff800082143a20
> > x29: ffff800082143a20 x28: 0000000000000140 x27: 0000000000000000
> > x26: ffff582c47a313e8 x25: ffff582c53e95000 x24: ffff582c53e92800
> > x23: ffff582c5bbe0000 x22: 0000000000000000 x21: ffff582c47a31080
> > x20: ffffa0d78cfa4168 x19: ffffa0d78cfa4168 x18: ffffb755b0519000
> > x17: 000000040044ffff x16: 00500072b5503510 x15: a7a6a5a4a3a29a99
> > x14: 989796959493928a x13: 0000000051eb851f x12: 00000000000000ff
> > x11: ffffa0d78d812880 x10: ffffa0d78d7fbca0 x9 : 000000000000003f
> > x8 : 0000000000000063 x7 : 000000000000003f x6 : 0000000000000040
> > x5 : ffff80008010d000 x4 : ffffa0d78cfa4168 x3 : ffffa0d78cfbfdd8
> > x2 : ffff80008010d0f4 x1 : 0000000000000020 x0 : 0000000000000140
> > Kernel panic - not syncing: Asynchronous SError Interrupt
> > CPU: 2 UID: 0 PID: 804 Comm: videotestsrc0:s Not tainted 6.12.0-rc5+
> > #54
> > Hardware name: Firefly roc-rk3328-cc (DT)
> > Call trace:
> > dump_backtrace+0xa0/0x128
> > show_stack+0x20/0x38
> > dump_stack_lvl+0xc8/0xf8
> > dump_stack+0x18/0x28
> > panic+0x3ec/0x428
> > nmi_panic+0x48/0xa0
> > arm64_serror_panic+0x6c/0x88
> > do_serror+0x30/0x70
> > el1h_64_error_handler+0x38/0x60
> > el1h_64_error+0x7c/0x80
> > rockchip_vpu2_jpeg_enc_run+0x168/0xbc8
> > device_run+0xb0/0x128
> > v4l2_m2m_try_run+0xac/0x230
> > v4l2_m2m_ioctl_streamon+0x70/0x90
> > v4l_streamon+0x2c/0x40
> > __video_do_ioctl+0x194/0x400
> > video_usercopy+0x10c/0x808
> > video_ioctl2+0x20/0x80
> > v4l2_ioctl+0x48/0x70
> > __arm64_sys_ioctl+0xb0/0x100
> > invoke_syscall+0x50/0x120
> > el0_svc_common.constprop.0+0x48/0xf0
> > do_el0_svc+0x24/0x38
> > el0_svc+0x38/0x100
> > el0t_64_sync_handler+0xc0/0xc8
> > el0t_64_sync+0x1a8/0x1b0
> > SMP: stopping secondary CPUs
> > Kernel Offset: 0x20d70c000000 from 0xffff800080000000
> > PHYS_OFFSET: 0xffffa7d3c0000000
> > CPU features: 0x00,00000090,00200000,0200421b
> > Memory Limit: none
> > ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
> >
> > Fixes: 7c696693a4f5 ("soc: rockchip: power-domain: Add power domain
> > driver")
> > Signed-off-by: Peter Geis <pgwipeout at gmail.com>
>
> This patch is obviously correct, because not checking what
> rockchip_pmu_set_idle_request() returns was simply wrong.
> Thanks for the patch!
>
> Though, shouldn't we improve further the way proper error
> handling is performed in rockchip_do_pmu_set_power_domain(),
> by "rolling back" what rockchip_do_pmu_set_power_domain()
> did after powering up fails?

Eventually, but the reality is if we hit this path the hardware is
already broken. I wanted to provide a fix with the least amount of
risk of breaking things further. I'm open to suggestions for further
improvements here.

Current behavior with this patch with the example above causes
gstreamer to wait indefinitely. I'll trace the return path and see if
returning an error other than -ETIMEDOUT triggers more robust
handling.

>
> > ---
> >
> >  drivers/pmdomain/rockchip/pm-domains.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pmdomain/rockchip/pm-domains.c
> > b/drivers/pmdomain/rockchip/pm-domains.c
> > index cb0f93800138..57e8fa25d2bd 100644
> > --- a/drivers/pmdomain/rockchip/pm-domains.c
> > +++ b/drivers/pmdomain/rockchip/pm-domains.c
> > @@ -590,14 +590,18 @@ static int rockchip_pd_power(struct
> > rockchip_pm_domain *pd, bool power_on)
> >                       rockchip_pmu_save_qos(pd);
> >
> >                       /* if powering down, idle request to NIU first */
> > -                     rockchip_pmu_set_idle_request(pd, true);
> > +                     ret = rockchip_pmu_set_idle_request(pd, true);
> > +                     if (ret < 0)
> > +                             return ret;
> >               }
> >
> >               rockchip_do_pmu_set_power_domain(pd, power_on);
> >
> >               if (power_on) {
> >                       /* if powering up, leave idle mode */
> > -                     rockchip_pmu_set_idle_request(pd, false);
> > +                     ret = rockchip_pmu_set_idle_request(pd, false);
> > +                     if (ret < 0)
> > +                             return ret;
> >
> >                       rockchip_pmu_restore_qos(pd);
> >               }



More information about the Linux-rockchip mailing list