qemu riscv, thead c906, Linux boot regression

Wed Jan 24 05:27:10 PST 2024

Conor Dooley <conor at kernel.org> writes:

> On Wed, Jan 24, 2024 at 01:49:51PM +0100, Björn Töpel wrote:
>> Hi!
>> 
>> I bumped the RISC-V Linux kernel CI to use qemu 8.2.0, and realized that
>> thead c906 didn't boot anymore. Bisection points to commit d6a427e2c0b2
>> ("target/riscv/cpu.c: restrict 'marchid' value")
>> 
>> Reverting that commit, or the hack below solves the boot issue:
>> 
>> --8<--
>> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
>> index 8cbfc7e781ad..e18596c8a55a 100644
>> --- a/target/riscv/cpu.c
>> +++ b/target/riscv/cpu.c
>> @@ -505,6 +505,9 @@ static void rv64_thead_c906_cpu_init(Object *obj)
>>      cpu->cfg.ext_xtheadsync = true;
>>  
>>      cpu->cfg.mvendorid = THEAD_VENDOR_ID;
>> +    cpu->cfg.marchid = ((QEMU_VERSION_MAJOR << 16) |
>> +                        (QEMU_VERSION_MINOR << 8)  |
>> +                        (QEMU_VERSION_MICRO));
>>  #ifndef CONFIG_USER_ONLY
>>      set_satp_mode_max_supported(cpu, VM_1_10_SV39);
>>  #endif
>> --8<--
>> 
>> I'm unsure what the correct qemu way of adding a default value is,
>> or if c906 should have a proper marchid.
>
> The "correct" marchid/mimpid values for the c906 are zero.

Ok! Thanks for clearing that up for me.

> I haven't looked into the code at all, so I am "assuming" that it is
> being zero intialised at present. Linux applies the errata fixups for
> the c906 when archid and impid are both zero - so your patch will avoid
> these fixups being applied.

I'm also assuming 0, -- will double-check. Hmm, that means that the
*previous* marchid was incorrect (pre d6a427e2c0b2).

> Do you think that perhaps the emulation in QEMU does not support what
> the kernel uses once then errata fixups are enabled?

Did a quick look at the c906 "in_asm,int" logs:

| 0x80201040:  12000073          sfence.vma              zero,zero
| 0x80201044:  18051073          csrrw                   zero,satp,a0
| 
| riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0x0000000080201048, tval:0x0000000080201048, desc=exec_page_fault
| riscv_cpu_do_interrupt: hart:0, async:0, cause:000000000000000c, epc:0xffffffff80001048, tval:0xffffffff80001048, desc=exec_page_fault
| ...cont forever

So it looks like we're tripping over the page tables, when we're turning
on paging.

Hmm, maybe it's not qemu, but the c906 that has been broken for a while?

I'll disable it temporarily from CI anyhow, and will continue digging.

Thanks for the pointers/clarifications, Conor!
Björn