[PATCH] target/riscv: Exit current TB after an sfence.vma

Palmer Dabbelt palmer at dabbelt.com
Wed Mar 30 09:11:18 PDT 2022


[re-ordering the top post]

+linux-riscv, as this may very well be a kernel bug

On Wed, 30 Mar 2022 05:38:30 PDT (-0700), phantom at zju.edu.cn wrote:
>> -----Original Messages-----
>> From: "Idan Horowitz" <idan.horowitz at gmail.com>
>> Sent Time: 2022-03-30 15:35:19 (Wednesday)
>> To: "Atish Patra" <atishp at atishpatra.org>
>> Cc: phantom at zju.edu.cn, "open list:RISC-V" <qemu-riscv at nongnu.org>, "Alistair Francis" <Alistair.Francis at wdc.com>, "qemu-devel at nongnu.org Developers" <qemu-devel at nongnu.org>
>> Subject: Re: [PATCH] target/riscv: Exit current TB after an sfence.vma
>> 
>> On Wed, 30 Mar 2022 at 10:28, Atish Patra <atishp at atishpatra.org> wrote:
>> >
>> > I tested on v5.17 built from defconfig for rv64.
>> >
>> > Here is the kernel code executing sfence.vma
>> > https://elixir.bootlin.com/linux/v5.17/source/arch/riscv/kernel/head.S#L122
>> >
>> 
>> I believe this is a kernel bug and not a QEMU one. They perform a
>> write to the SATP with the same ASID as the one used before (0) and

I seem to remember ASID 0 being a special one, with some global-ness, 
but I couldn't find that in a quick poke into the spec.  As below, I'm 
going to read through this a bit more...

(Also, I haven't had any coffee yet)

>> then expect it to be used, without performing an sfence.vma following
>> it.
>> This was exposed by my change, as previously the write to the satp was
>> performed in the same TB block as the sfence.vma *before it*, which
>> meant the TLB was not filled in between the previous sfence and the
>> write to SATP following it.
>> I was able to reproduce the issue with the Fedora Rawhide image in the
>> wiki, and I was able to resolve it by artificially forcing a TLB flush
>> following all writes to SATP.
>> I think the correct course of action here is to:
>> 1. Report the issue to the linux kernel mailing list and/or contribute
>> a patch that adds said missing sfence.vma following the SATP write.
>> (Atish: Are you able to test if adding an sfence.vma in your kernel
>> build fixes the issue?)

If it's a kernel bug we should fix it, but I'm not entirely convinced 
that's the case.  I can confirm that the following makes Linux boot 
again

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index ec07f991866a..83373c2bd7d0 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -121,6 +121,7 @@ relocate:
        or a0, a0, a1
        sfence.vma
        csrw CSR_SATP, a0
+       sfence.vma
 .align 2
 1:
        /* Set trap vector to spin forever to help debug */

It's been a while since I read through the rules here so I'm going to go 
read them again, but IIRC that shouldn't be necessary: that first 
sfence.vma should be sufficiently global to ensure all prior writes to 
the page tables are visible, regardless of what's in SATP.  That said, I 
remember there being a lot of subtly here in the spec wording so I'm 
going to go read the spec again.

>> 2. Restore the patch

Presumably you mean "revert" here?  That might be the right way to go, 
just to avoid breaking users (even if we fix the kernel bug, it'll take 
a while to get everyone to update).  That said, this smells like the 
sort of thing that's going to crop up at arbitrary times in dynamic 
systems so while a revert looks like it'd work around the boot issue we 
might be making more headaches for folks down the road.

> I agree with you partly, my test case is actually from linux kernel, I notice 
> the strange sfence.vma before write satp during write our teaching kernel.
> I think, the strange code is used to bypass the qemu bug that Idan patched.
> Because in hardware, if the stap is empty, sfence.vma will do nothing.
> And that's why nobody report it.

IIUC the sfence.vma before the SATP write is very explicitly necessary: 
without that fence old mappings could be utilized directly after the 
SATP write, so we might not even be able to fetch the next instruction.

> Before patch, qemu won't end a BB after sfence (but jump and CSR write do).
> So, the kernel author reodered write stap and sfence.vma to make sfence.vma 
> place in the same BB with write satp, instead of the following write stvec.
> (If don't reorder, sfence.vma will place in the same BB with write stvec,
> that will crash kernel, see my origin analysis).
>
> However, in hardware, since tlb is empty, put the first sfence.vma before or 
> after write satp is not really matters.
> In qemu, as Atish's log shown, we should do a imply invisible sfence.vma after
> write stap to invalid qemu's translation cache.

The ISA manual is quite explicit about SATP not enforcing these 
orderings.  If what I remember about ASID 0 is true then I do think we'd 
need one, though, to avoid the aliasing -- hopefully I'll make a bit 
more sense soon, though...



More information about the linux-riscv mailing list