[REGRESSION] amdgpu: async system error exception from hdp_v5_0_flush_hdp()
Christian König
christian.koenig at amd.com
Wed Apr 16 04:44:13 PDT 2025
Am 15.04.25 um 20:28 schrieb Alexey Klimov:
> #regzbot introduced: v6.12..v6.13
>
> I use RX6600 on arm64 Orion o6 board and it seems that amdgpu is broken on recent kernels, fails on boot:
Well in general we already had tons of problems with low end ARM64 boards. So first question of all is that board SBSA certified?
If not then the chances of that board actually working correctly are very low unfortunately.
> [drm] amdgpu: 7886M of GTT memory ready.
> [drm] GART: num cpu pages 131072, num gpu pages 131072
> SError Interrupt on CPU11, code 0x00000000be000011 -- SError
Any idea what that error code means?
Thanks,
Christian.
> CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY
> Tainted: [S]=CPU_OUT_OF_SPEC
> Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980
> pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : amdgpu_device_rreg+0x60/0xe4 [amdgpu]
> lr : hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
> sp : ffffffc08321b490
> x29: ffffffc08321b490 x28: ffffff80b8b80000 x27: ffffff80b8bd0178
> x26: ffffff80b8b8fe88 x25: 0000000000000001 x24: ffffff8081647000
> x23: ffffffc079d6e000 x22: ffffff80b8bd5000 x21: 000000000007f000
> x20: 000000000001fc00 x19: 00000000ffffffff x18: 00000000000015fc
> x17: 00000000000015fc x16: 00000000000015cf x15: 00000000000015ce
> x14: 00000000000015d0 x13: 00000000000015d1 x12: 00000000000015d2
> x11: 00000000000015d3 x10: 000000000000ec00 x9 : 00000000000015fd
> x8 : 00000000000015fd x7 : 0000000000001689 x6 : 0000000000555401
> x5 : 0000000000000001 x4 : 0000000000100000 x3 : 0000000000100000
> x2 : 0000000000000000 x1 : 000000000007f000 x0 : 0000000000000000
> Kernel panic - not syncing: Asynchronous SError Interrupt
> CPU: 11 UID: 0 PID: 255 Comm: (udev-worker) Tainted: G S 6.15.0-rc2+ #1 VOLUNTARY
> Tainted: [S]=CPU_OUT_OF_SPEC
> Hardware name: Radxa Computer (Shenzhen) Co., Ltd. Radxa Orion O6/Radxa Orion O6, BIOS 1.0 Jan 1 1980
> Call trace:
> show_stack+0x2c/0x84 (C)
> dump_stack_lvl+0x60/0x80
> dump_stack+0x18/0x24
> panic+0x148/0x330
> add_taint+0x0/0xbc
> arm64_serror_panic+0x64/0x7c
> do_serror+0x28/0x68
> el1h_64_error_handler+0x30/0x48
> el1h_64_error+0x6c/0x70
> amdgpu_device_rreg+0x60/0xe4 [amdgpu] (P)
> hdp_v5_0_flush_hdp+0x6c/0x80 [amdgpu]
> gmc_v10_0_hw_init+0xec/0x1fc [amdgpu]
> amdgpu_device_init+0x19f8/0x2480 [amdgpu]
> amdgpu_driver_load_kms+0x20/0xb0 [amdgpu]
> amdgpu_pci_probe+0x1b8/0x5d4 [amdgpu]
> pci_device_probe+0xbc/0x1a8
> really_probe+0xc0/0x39c
> __driver_probe_device+0x7c/0x14c
> driver_probe_device+0x3c/0x120
> __driver_attach+0xc4/0x200
> bus_for_each_dev+0x68/0xb4
> driver_attach+0x24/0x30
> bus_add_driver+0x110/0x240
> driver_register+0x68/0x124
> __pci_register_driver+0x44/0x50
> amdgpu_init+0x84/0xf94 [amdgpu]
> do_one_initcall+0x60/0x1e0
> do_init_module+0x54/0x200
> load_module+0x18f8/0x1e68
> init_module_from_file+0x74/0xa0
> __arm64_sys_finit_module+0x1e0/0x3f0
> invoke_syscall+0x64/0xe4
> el0_svc_common.constprop.0+0x40/0xe0
> do_el0_svc+0x1c/0x28
> el0_svc+0x34/0xd0
> el0t_64_sync_handler+0x10c/0x138
> el0t_64_sync+0x198/0x19c
> SMP: stopping secondary CPUs
> Kernel Offset: disabled
> CPU features: 0x1000,000000e0,f169a650,9b7ff667
> Memory Limit: none
> ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
>
> (bios version seems to be 45 years old but that is the state of the board
> when I received it)
>
> Also saw this crash with RX6700. Old radeons like HD5450 and nvidia gt1030
> work fine on that board.
>
> A little bit of testing showed that it was introduced between 6.12 and 6.13.
> Also it seems that changes were taken by some distro kernels already and
> different iso images I tried failed to boot before I bumped into some iso
> with kernel 6.8 that worked just fine.
>
> The only change related to hdp_v5_0_flush_hdp() was
> cf424020e040 drm/amdgpu/hdp5.0: do a posting read when flushing HDP
>
> Reverting that commit ^^ did help and resolved that problem. Before sending
> revert as-is I was interested to know if there supposed to be a proper fix
> for this or maybe someone is interested to debug this or have any suggestions.
>
> In theory I also need to confirm that exactly that change introduced the
> regression.
>
> Thanks,
> Alexey
>
More information about the linux-arm-kernel
mailing list