[PATCH 0/2] lib: sbi: Prevent system hang when bringing up secondary harts

Sun Feb 19 14:20:29 PST 2023

The boot hart can race with the secondary harts when it tries to bring them
up at startup time. This results in occasional system hangs at boot, which
can be reproduced in QEMU with a Linux guest, even with '-machine virt'.

When the Linux kernel needs to bring up the non-boot harts, the boot hart
enters opensbi via a ecall and eventually gets to sbi_hsm_hart_start().
Other (secondary) harts are spinning in the while loop in
sbi_hsm_hart_wait() at that moment:

	/* Wait for state transition requested by sbi_hsm_hart_start() */
	while (atomic_read(&hdata->state) != SBI_HSM_STATE_START_PENDING) {
		wfi();
	};

If I understand it correctly, the idea is that sbi_hsm_hart_start() should
first set the secondary hart's state to SBI_HSM_STATE_START_PENDING
atomically and check the previous value of the state. This makes sure no
other hart would try to modify the secondary hart's scratch area in the
meantime.

The boot hart then reads the init count of the secondary hart and sets the
address it should eventually jump to, as well as other data:

	init_count = sbi_init_count(hartid);
	rscratch->next_arg1 = arg1;
	rscratch->next_addr = saddr;
	rscratch->next_mode = smode;

Then, the boot hart sends IPI to the secondary hart, so that it would stop
spinning in the loop above and would continue.

However, as we observed during the test runs for our systems, the secondary
harts could leave the loop in sbi_hsm_hart_wait() as soon as their state is
set to SBI_HSM_STATE_START_PENDING. As a result, the following situation is
possible:

Boot hart (#0)                        Secondary hart (#1)
runs Linux startup code               waits in sbi_hsm_hart_wait()
sbi_ecall(SBI_EXT_HSM,
          SBI_EXT_HSM_HART_START,
          ...)
enters sbi_hsm_hart_start()
sets state to START_PENDING
                                      leaves sbi_hsm_hart_wait()
                                      runs to the end of init_warmboot()
                                      returns to scratch->next_addr
                                      (next_addr can be garbage here)

sets next_addr, etc. for hart #1
(no good: hart #1 has already left)

sends IPI to hart #1
(no good either)
                   (system might hang or crash)

This can be reproduced easier, if one inserts a delay into
sbi_hsm_hart_start() right before "init_count = sbi_init_count(hartid);".
In our case, "sbi_timer_mdelay(10)" was enough for the issue to manifest
each time.

The patch series fixes 2 issues that could make the secondary harts exit
the while loop in sbi_hsm_hart_wait() prematurely, i.e. before the IPI:

1. "wfi" instruction is just a hint for the CPU, it might or might not
actually wait for an interrupt to happen in the hart. RISC-V Privileged
Architectures spec. v20211203 even states that "a legal implementation is to
simply implement WFI as a NOP".

2. Since commit 50d4fde1c5a4 ("lib: Remove redundant sbi_platform_ipi_clear()
calls"), the IPI sent from the boot hart in wake_coldboot_harts() is never
cleared in the secondary harts. So, they enter sbi_hsm_hart_wait() with
already pending IPIs and do not wait for new ones.

In both cases, the secondary hart might leave the loop as soon as its state
is set to SBI_HSM_STATE_START_PENDING, leading to the race described above.

Regards,
Evgenii