[Question] Call trace occurs occasionally when a rollback is performed upon CPU online timeout

Wed Jan 15 04:32:37 PST 2025

Hi all,

I have a question about CPU online/offline. In the following test 
scenario, various tasks(iperf,fio,sve,...) are executed in a VM with 6 
vCPUs. At the same time, repeat online/offline operations on two of the 
vCPUs through /sys/devices/system/cpu/cpuX/online. After running for 
many hours,some calltrace will appear in the guest.
The first, WARN_ON_ONCE(test_bit(KTHREAD_SHOULD_PARK, &kthread->flags)) 
is triggered.
> Call trace:
> kthread_park+0xd0/0xdc
> takedown_cpu+0x4c/0x140
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

The second, BUG_ON(!irqs_disabled() && !IS_ENABLED(CONFIG_PREEMPT_RT)) 
is triggered.
> Call trace:
> irq_work_run_list+0x64/0x70
> smpcfd_dying_cpu+0x24/0x34
> cpuhp_invoke_callback+0x160/0x6e0
> _cpu_up+0x1a4/0x200
> cpu_up+0xbc/0x100
> cpu_device_up+0x20/0x30
> cpu_subsys_online+0x4c/0xb0
> device_online+0x7c/0xa0
> online_store+0xd0/0xe0
> dev_attr_store+0x20/0x34
> sysfs_kf_write+0x4c/0x5c
> kernfs_fop_write_iter+0x130/0x1c0
> new_sync_write+0xec/0x18c
> vfs_write+0x214/0x2ac
> ksys_write+0x70/0xfc
> __arm64_sys_write+0x24/0x30
> invoke_syscall+0x50/0x11c
> el0_svc_common.constprop.0+0x68/0x164
> do_el0_svc+0x34/0xcc
> el0_svc+0x20/0x30
> el0_sync_handler+0xb8/0xc0
> el0_sync+0x160/0x180

According to my analysis, the root cause of the question is because the 
vCPU online times out, but in fact the vCPU was successfully online. 
Rollback is performed due to timeout. During the rollback, the 
secondary_start_kernel is still executing, resulting in the above call 
trace. So is this a bug? If so, how should it be repaired?

The reason for the timeout has not been found. It is suspected that it 
is caused by excessive task pressure. If you have other ideas, please 
point them out.

Thanks,
Kunkun Jiang