Turning the MMU on .....
Paul Campbell
taniwha at gmail.com
Sun Nov 29 19:12:48 EST 2020
I'm new to this list, likely ignorant to past discussions, please bear with me
:-)
I'm bringing up a new core, one that's heavily pipelined/speculative/O-O all
that good stuff, and reached the point where linux is coming up, most of my
issues are mine but there's one I've come across in the tip-of-tree riscv
linux that I think is more general .... it's this code in relocate() in head.S
- to paraphrase the code is:
relocate:
li a1, PAGE_OFFSET
la a2, _start
sub a1, a1, a2 //a1 is relocation offset
....
la a2, 1f
add a2, a2, a1
csrw CSR_TVEC, a2 // vector is 1f relocated
....
la a0, trampoline_pg_dir
srl a0, a0, PAGE_SHIFT
or a0, a0, a1
sfence.vma
csrw CSR_SATP, a0
.align 2
1: ....
In my world this fails miserably, mostly because the sfence.vma does a pipe
flush (as it should) and by the time the csrw CSR_SATP, a0 is executed has
already fetched (using the old [turned off] MMU mapping) and speculatively
executed much of the code up until the following return instruction. What I
think that the code is expecting to happen is that the instruction following
the write to CSR_SATP will fault and refetch the instruction stream using the
new mapping, and this likely works on some microarchitectures, it also
probably works by happenstance on some systems where there happens to be an
invalid instruction hiding under the ".align 2".
Reading the RISC-V priviliged spec it's very explicit about "csrw CSR_SATP,
a0":
"Note that writing satp does not imply any ordering constraints between page-
table updates and subsequent address translations. If the new address space’s
page tables have been modified, or if an ASID is reused, it may be necessary to
execute an SFENCE.VMA instruction (see Section 4.2.1) after writing satp."
4.2.1 includes the note:
"A consequence of this specification is that an implementation may use any
translation for an address that was valid at any time since the most recent
SFENCE.VMA that subsumes that address. In particular, if a leaf PTE is modified
but a subsuming SFENCE.VMA is not executed, either the old translation or the
new translation will be used, but the choice is unpredictable. The behavior is
otherwise well-defined."
What does this mean? it means that if you SFENCE.VMA and then subsequently
write to satp it is undefined whether the new page table regime is in place for
an arbitrary number of instructions thereafter (this number could be quite
large if you are turning on the MMU for the first time because some larger
systems may have hundreds of decoded instructions in flight at a time - in some
versions of my current system it can be ~100, though in this particular case
it's more likely in the order of 10-12 or so instructions that manage to pass
the instruction TLB between when the sfence is executed and when the satp is
written).
In general I think that for RISCV mmu code to work we always need to sfence
after every write to satp or page tables (as the spec says it needs to be for
an 'enclosing range') .... AND there needs to be a mapping in place in the MMU
configuration both before and after the execution of the write to satp that
includes a valid mapping of the virtual address of the code fragment between
where the write to satp occurs and the sfence instruction.
This last requirement, is normally not an issue in the linux kernel since all
the code is mapped with one big mapping that doesn't change .... except of
course when you first turn on the MMU, when you're switching from no MMU to a
running MMU - which is the situation where I started this discussion.
-------------------------------------------------------------------------------------------------------------------
So a proposal: rather than use the 'trampoline' code, that only works for some
systems, we should use an initial kernel mapping that maps both the kernel
virtual addresses and also maps the initial memory 1:1. If we do that then the
actual initial switch becomes simple (see the attached code fragment), the
other required change is in setup_vm() - instead of making a 'trampoline'
mapping and an initial kernel mapping we just make an initial kernel mapping
that also contains a 1:1 mapping for the initially loaded kernel.
Anyway this has gone on too long, hopefully the right people will read it and
understand - as I mentioned above I'm a noob here (but a kernel hack since V6,
and have been laying gates for almost as long)
Paul Campbell
Moonbase Otago
-------------- next part --------------
#ifdef CONFIG_MMU
relocate:
/* Relocate return address */
li a1, PAGE_OFFSET
la a2, _start
sub a1, a1, a2
add ra, ra, a1
add gp, gp, a1
/* Compute satp for kernel page tables */
srl a2, a0, PAGE_SHIFT
li a1, SATP_MODE
or a2, a2, a1
/*
* Switch to kernel page tables.
*/
csrw CSR_SATP, a2
sfence.vma
ret // switch to kernel addressing occurs here
#endif /* CONFIG_MMU */
More information about the linux-riscv
mailing list