[PATCH 1/6] pmdomain: rockchip: fix rockchip_pd_power error handling
Dragan Simic
dsimic at manjaro.org
Tue Dec 10 18:54:22 PST 2024
Hello Peter,
On 2024-12-10 21:12, Peter Geis wrote:
> On Tue, Dec 10, 2024 at 3:18 AM Dragan Simic <dsimic at manjaro.org>
> wrote:
>> On 2024-12-10 02:30, Peter Geis wrote:
>> > The calls rockchip_pd_power makes to rockchip_pmu_set_idle_request lack
>> > any return error handling, causing device drivers to incorrectly
>> > believe
>> > the hardware idle requests succeed when they may have failed. This
>> > leads
>> > to software possibly accessing hardware that is powered off and the
>> > subsequent SError panic that follows.
>> >
>> > Add error checking and return errors to the calling function to prevent
>> > such crashes.
>> >
>> > gst-launch-1.0 videotestsrc num-buffers=2000 ! v4l2jpegenc ! fakesink
>> > Setting pipeline to PAUSED ...er-x64
>> > Pipeline is PREROLLING ...
>> > Redistribute latency...
>> > rockchip-pm-domain ff100000.syscon:power-controller: failed to get ack
>> > on
>> > domain 'hevc', val=0x98260, idle = 0
>> > SError Interrupt on CPU2, code 0x00000000bf000002 -- SError
>> > CPU: 2 UID: 0 PID: 804 Comm: videotestsrc0:s Not tainted 6.12.0-rc5+
>> > #54
>> > Hardware name: Firefly roc-rk3328-cc (DT)
>> > pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> > pc : rockchip_vpu2_jpeg_enc_run+0x168/0xbc8
>> > lr : device_run+0xb0/0x128
>> > sp : ffff800082143a20
>> > x29: ffff800082143a20 x28: 0000000000000140 x27: 0000000000000000
>> > x26: ffff582c47a313e8 x25: ffff582c53e95000 x24: ffff582c53e92800
>> > x23: ffff582c5bbe0000 x22: 0000000000000000 x21: ffff582c47a31080
>> > x20: ffffa0d78cfa4168 x19: ffffa0d78cfa4168 x18: ffffb755b0519000
>> > x17: 000000040044ffff x16: 00500072b5503510 x15: a7a6a5a4a3a29a99
>> > x14: 989796959493928a x13: 0000000051eb851f x12: 00000000000000ff
>> > x11: ffffa0d78d812880 x10: ffffa0d78d7fbca0 x9 : 000000000000003f
>> > x8 : 0000000000000063 x7 : 000000000000003f x6 : 0000000000000040
>> > x5 : ffff80008010d000 x4 : ffffa0d78cfa4168 x3 : ffffa0d78cfbfdd8
>> > x2 : ffff80008010d0f4 x1 : 0000000000000020 x0 : 0000000000000140
>> > Kernel panic - not syncing: Asynchronous SError Interrupt
>> > CPU: 2 UID: 0 PID: 804 Comm: videotestsrc0:s Not tainted 6.12.0-rc5+
>> > #54
>> > Hardware name: Firefly roc-rk3328-cc (DT)
>> > Call trace:
>> > dump_backtrace+0xa0/0x128
>> > show_stack+0x20/0x38
>> > dump_stack_lvl+0xc8/0xf8
>> > dump_stack+0x18/0x28
>> > panic+0x3ec/0x428
>> > nmi_panic+0x48/0xa0
>> > arm64_serror_panic+0x6c/0x88
>> > do_serror+0x30/0x70
>> > el1h_64_error_handler+0x38/0x60
>> > el1h_64_error+0x7c/0x80
>> > rockchip_vpu2_jpeg_enc_run+0x168/0xbc8
>> > device_run+0xb0/0x128
>> > v4l2_m2m_try_run+0xac/0x230
>> > v4l2_m2m_ioctl_streamon+0x70/0x90
>> > v4l_streamon+0x2c/0x40
>> > __video_do_ioctl+0x194/0x400
>> > video_usercopy+0x10c/0x808
>> > video_ioctl2+0x20/0x80
>> > v4l2_ioctl+0x48/0x70
>> > __arm64_sys_ioctl+0xb0/0x100
>> > invoke_syscall+0x50/0x120
>> > el0_svc_common.constprop.0+0x48/0xf0
>> > do_el0_svc+0x24/0x38
>> > el0_svc+0x38/0x100
>> > el0t_64_sync_handler+0xc0/0xc8
>> > el0t_64_sync+0x1a8/0x1b0
>> > SMP: stopping secondary CPUs
>> > Kernel Offset: 0x20d70c000000 from 0xffff800080000000
>> > PHYS_OFFSET: 0xffffa7d3c0000000
>> > CPU features: 0x00,00000090,00200000,0200421b
>> > Memory Limit: none
>> > ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>> >
>> > Fixes: 7c696693a4f5 ("soc: rockchip: power-domain: Add power domain
>> > driver")
>> > Signed-off-by: Peter Geis <pgwipeout at gmail.com>
>>
>> This patch is obviously correct, because not checking what
>> rockchip_pmu_set_idle_request() returns was simply wrong.
>> Thanks for the patch!
>>
>> Though, shouldn't we improve further the way proper error
>> handling is performed in rockchip_do_pmu_set_power_domain(),
>> by "rolling back" what rockchip_do_pmu_set_power_domain()
>> did after powering up fails?
>
> Eventually, but the reality is if we hit this path the hardware is
> already broken. I wanted to provide a fix with the least amount of
> risk of breaking things further. I'm open to suggestions for further
> improvements here.
Perhaps a good approach, at the moment, would be to add no more
code for the "rollback", but to add a TODO note about the need for
that addition. That way we'd keep the amount of changes at the
minimum, to hopefully not cause any regressions, while leaving
a note for the second round of improvements in the future.
> Current behavior with this patch with the example above causes
> gstreamer to wait indefinitely. I'll trace the return path and see if
> returning an error other than -ETIMEDOUT triggers more robust
> handling.
Sounds like a plan to me. Thanks for working on this!
>>
>> > ---
>> >
>> > drivers/pmdomain/rockchip/pm-domains.c | 8 ++++++--
>> > 1 file changed, 6 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/pmdomain/rockchip/pm-domains.c
>> > b/drivers/pmdomain/rockchip/pm-domains.c
>> > index cb0f93800138..57e8fa25d2bd 100644
>> > --- a/drivers/pmdomain/rockchip/pm-domains.c
>> > +++ b/drivers/pmdomain/rockchip/pm-domains.c
>> > @@ -590,14 +590,18 @@ static int rockchip_pd_power(struct
>> > rockchip_pm_domain *pd, bool power_on)
>> > rockchip_pmu_save_qos(pd);
>> >
>> > /* if powering down, idle request to NIU first */
>> > - rockchip_pmu_set_idle_request(pd, true);
>> > + ret = rockchip_pmu_set_idle_request(pd, true);
>> > + if (ret < 0)
>> > + return ret;
>> > }
>> >
>> > rockchip_do_pmu_set_power_domain(pd, power_on);
>> >
>> > if (power_on) {
>> > /* if powering up, leave idle mode */
>> > - rockchip_pmu_set_idle_request(pd, false);
>> > + ret = rockchip_pmu_set_idle_request(pd, false);
>> > + if (ret < 0)
>> > + return ret;
>> >
>> > rockchip_pmu_restore_qos(pd);
>> > }
More information about the linux-arm-kernel
mailing list