[PATCH 1/2] lib: sbi: Actually wait for an IPI in sbi_hsm_hart_wait()

Mon Feb 20 00:28:49 PST 2023

Hi,

On 20.02.2023 01:38, Jessica Clarke wrote:
> On 19 Feb 2023, at 22:20, Evgenii Shatokhin <e.shatokhin at yadro.com> wrote:
>>
>> "wfi" instruction is not guaranteed to wait for an interrupt to be received
>> by the hart, it is just a hint for the CPU. According to RISC-V Privileged
>> Architectures spec. v20211203, even an implementation of "wfi" as "nop" is
>> legal.
>>
>> As a result, a secondary (non-boot) hart waiting in the while loop in
>> sbi_hsm_hart_wait() might leave the loop and continue even if it got no
>> IPI. This could lead to the following race condition when booting Linux,
>> for example:
>>
>>   Boot hart (#0)                        Secondary hart (#1)
>>   runs Linux startup code               waits in sbi_hsm_hart_wait()
>>
>>   sbi_ecall(SBI_EXT_HSM,
>>             SBI_EXT_HSM_HART_START,
>>             ...)
>>   enters sbi_hsm_hart_start()
>>   sets state of hart #1 to START_PENDING
>>                                         leaves sbi_hsm_hart_wait()
>>                                         runs to the end of init_warmboot()
>>                                         returns to scratch->next_addr
>>                                         (next_addr can be garbage here)
>>
>>   sets next_addr, etc. for hart #1
>>   (no good: hart #1 has already left)
>>
>>   sends IPI to hart #1
>>   (no good either)
>>
>> If this happens, the secondary hart jumps to a wrong next_addr at the end
>> of init_warmboot(), which leads to a system hang or crash.
>>
>> To reproduce the issue more reliably, one could add a delay in
>> sbi_hsm_hart_start(): after setting the hart's state but before sending
>> IPI to that hart:
>>
>>     hstate = atomic_cmpxchg(&hdata->state, SBI_HSM_STATE_STOPPED,
>>                             SBI_HSM_STATE_START_PENDING);
>>     ...
>>   + sbi_timer_mdelay(10);
>>     init_count = sbi_init_count(hartid);
>>     rscratch->next_arg1 = arg1;
>>     rscratch->next_addr = saddr;
>>
>> The issue can be reproduced, for example, in a QEMU VM with '-machine virt'
>> and 2 or more CPUs, with Linux as the guest OS.
>>
>> Let us actually check in sbi_hsm_hart_wait(), if the waiting hart has
>> received an IPI.
> 
> Using the IPI itself as the synchronisation barrier seems awfully
> brittle. Better to split SBI_HSM_STATE_START_PENDING into two states so
> there’s an additional atomic_write that the atomic_read here can
> synchronise with.

Yes, using an additional state was the first idea I tried to implement 
to fix the issue. It turned out to be too error-prone, however: 
SBI_HSM_STATE_START_PENDING is used in a number of places in opensbi, 
and it is far from obvious how to use the new state correctly there, 
esp. without violation of the spec.

On the other hand, wait_for_coldboot() and wake_coldboot_harts() already 
use IPI and a condition variable 'coldboot_done' for synchronization, 
and it seems to work OK.

So, I suppose, relying on IPI in sbi_hsm_hart_wait() and 
sbi_hsm_hart_start() too might be a way to go. Looks like, it was the 
intended way:

* Set the state to START_PENDING and get exclusive access to the target 
hart's scratch area.
* Write next_addr, next_arg1, etc., for the target hart.
* Send IPI to the target hart to confirm that it might continue.

WFI might or might not wait for an interrupt in the target hart, so the 
boot hart must send an IPI to the target anyway to wake it reliably.

Or, perhaps, I am missing some particular issue with this approach?

> 
> Jess
> 
>> Signed-off-by: Evgenii Shatokhin <e.shatokhin at yadro.com>
>> ---
>> lib/sbi/sbi_hsm.c | 8 ++++++--
>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/sbi/sbi_hsm.c b/lib/sbi/sbi_hsm.c
>> index 8499bb1..0e5ff3d 100644
>> --- a/lib/sbi/sbi_hsm.c
>> +++ b/lib/sbi/sbi_hsm.c
>> @@ -108,6 +108,8 @@ void sbi_hsm_prepare_next_jump(struct sbi_scratch *scratch, u32 hartid)
>> static void sbi_hsm_hart_wait(struct sbi_scratch *scratch, u32 hartid)
>> {
>>        unsigned long saved_mie;
>> +     unsigned long ipi_received;
>> +     unsigned int hstate;
>>        struct sbi_hsm_data *hdata = sbi_scratch_offset_ptr(scratch,
>>                                                            hart_data_offset);
>>        /* Save MIE CSR */
>> @@ -117,9 +119,11 @@ static void sbi_hsm_hart_wait(struct sbi_scratch *scratch, u32 hartid)
>>        csr_set(CSR_MIE, MIP_MSIP | MIP_MEIP);
>>
>>        /* Wait for state transition requested by sbi_hsm_hart_start() */
>> -     while (atomic_read(&hdata->state) != SBI_HSM_STATE_START_PENDING) {
>> +     do {
>>                wfi();
>> -     };
>> +             ipi_received = csr_read(CSR_MIP) & (MIP_MSIP | MIP_MEIP);
>> +             hstate = atomic_read(&hdata->state);
>> +     } while (hstate != SBI_HSM_STATE_START_PENDING || !ipi_received);
>>
>>        /* Restore MIE CSR */
>>        csr_write(CSR_MIE, saved_mie);
>> --
>> 2.34.1

Regards,
Evgenii