[PATCH] target/riscv: Exit current TB after an sfence.vma

Wed Mar 30 10:06:38 PDT 2022

On Wed, 30 Mar 2022 09:11:18 PDT (-0700), Palmer Dabbelt wrote:
> [re-ordering the top post]
>
> +linux-riscv, as this may very well be a kernel bug
>
> On Wed, 30 Mar 2022 05:38:30 PDT (-0700), phantom at zju.edu.cn wrote:
>>> -----Original Messages-----
>>> From: "Idan Horowitz" <idan.horowitz at gmail.com>
>>> Sent Time: 2022-03-30 15:35:19 (Wednesday)
>>> To: "Atish Patra" <atishp at atishpatra.org>
>>> Cc: phantom at zju.edu.cn, "open list:RISC-V" <qemu-riscv at nongnu.org>, "Alistair Francis" <Alistair.Francis at wdc.com>, "qemu-devel at nongnu.org Developers" <qemu-devel at nongnu.org>
>>> Subject: Re: [PATCH] target/riscv: Exit current TB after an sfence.vma
>>>
>>> On Wed, 30 Mar 2022 at 10:28, Atish Patra <atishp at atishpatra.org> wrote:
>>> >
>>> > I tested on v5.17 built from defconfig for rv64.
>>> >
>>> > Here is the kernel code executing sfence.vma
>>> > https://elixir.bootlin.com/linux/v5.17/source/arch/riscv/kernel/head.S#L122
>>> >
>>>
>>> I believe this is a kernel bug and not a QEMU one. They perform a
>>> write to the SATP with the same ASID as the one used before (0) and
>
> I seem to remember ASID 0 being a special one, with some global-ness,
> but I couldn't find that in a quick poke into the spec.  As below, I'm
> going to read through this a bit more...
>
> (Also, I haven't had any coffee yet)
>
>>> then expect it to be used, without performing an sfence.vma following
>>> it.
>>> This was exposed by my change, as previously the write to the satp was
>>> performed in the same TB block as the sfence.vma *before it*, which
>>> meant the TLB was not filled in between the previous sfence and the
>>> write to SATP following it.
>>> I was able to reproduce the issue with the Fedora Rawhide image in the
>>> wiki, and I was able to resolve it by artificially forcing a TLB flush
>>> following all writes to SATP.
>>> I think the correct course of action here is to:
>>> 1. Report the issue to the linux kernel mailing list and/or contribute
>>> a patch that adds said missing sfence.vma following the SATP write.
>>> (Atish: Are you able to test if adding an sfence.vma in your kernel
>>> build fixes the issue?)
>
> If it's a kernel bug we should fix it, but I'm not entirely convinced
> that's the case.  I can confirm that the following makes Linux boot
> again
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index ec07f991866a..83373c2bd7d0 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -121,6 +121,7 @@ relocate:
>         or a0, a0, a1
>         sfence.vma
>         csrw CSR_SATP, a0
> +       sfence.vma
>  .align 2
>  1:
>         /* Set trap vector to spin forever to help debug */
>
> It's been a while since I read through the rules here so I'm going to go
> read them again, but IIRC that shouldn't be necessary: that first
> sfence.vma should be sufficiently global to ensure all prior writes to
> the page tables are visible, regardless of what's in SATP.  That said, I
> remember there being a lot of subtly here in the spec wording so I'm
> going to go read the spec again.
>
>>> 2. Restore the patch
>
> Presumably you mean "revert" here?  That might be the right way to go,
> just to avoid breaking users (even if we fix the kernel bug, it'll take
> a while to get everyone to update).  That said, this smells like the
> sort of thing that's going to crop up at arbitrary times in dynamic
> systems so while a revert looks like it'd work around the boot issue we
> might be making more headaches for folks down the road.
>
>> I agree with you partly, my test case is actually from linux kernel, I notice
>> the strange sfence.vma before write satp during write our teaching kernel.
>> I think, the strange code is used to bypass the qemu bug that Idan patched.
>> Because in hardware, if the stap is empty, sfence.vma will do nothing.
>> And that's why nobody report it.
>
> IIUC the sfence.vma before the SATP write is very explicitly necessary:
> without that fence old mappings could be utilized directly after the
> SATP write, so we might not even be able to fetch the next instruction.
>
>> Before patch, qemu won't end a BB after sfence (but jump and CSR write do).
>> So, the kernel author reodered write stap and sfence.vma to make sfence.vma
>> place in the same BB with write satp, instead of the following write stvec.
>> (If don't reorder, sfence.vma will place in the same BB with write stvec,
>> that will crash kernel, see my origin analysis).
>>
>> However, in hardware, since tlb is empty, put the first sfence.vma before or
>> after write satp is not really matters.
>> In qemu, as Atish's log shown, we should do a imply invisible sfence.vma after
>> write stap to invalid qemu's translation cache.
>
> The ISA manual is quite explicit about SATP not enforcing these
> orderings.  If what I remember about ASID 0 is true then I do think we'd
> need one, though, to avoid the aliasing -- hopefully I'll make a bit
> more sense soon, though...

OK, I must have just been crazy -- there's nothing special about ASID=0.  
Looks to me like flushing the TLB on SATP writes is necessary, I just 
sent a patch with a more concrete description of why.

Thanks!