[PATCH RFC 00/17] arm64 kernel text replication

Mark Rutland mark.rutland at arm.com
Mon Jun 5 06:46:37 PDT 2023


On Mon, Jun 05, 2023 at 10:05:22AM +0100, Russell King (Oracle) wrote:
> Hi,
> 
> Are there any comments on this?

This is on my queue of things to review, but I haven't had the chance to give
more than a cursory look so far. I'm hoping to get to it in the next few days.

Thanks,
Mark.

> 
> Thanks.
> 
> On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> > Problem
> > -------
> > 
> > NUMA systems have greater latency when accessing data and instructions
> > across nodes, which can lead to a reduction in performance on CPU cores
> > that mainly perform accesses beyond their local node.
> > 
> > Normally when an ARM64 system boots, the kernel will end up placed in
> > memory, and each CPU core will have to fetch instructions and data from
> > which ever NUMA node the kernel has been placed. This means that while
> > executing kernel code, CPUs local to that node will run faster than
> > CPUs in remote nodes.
> > 
> > The higher the latency to access remote NUMA node memory, the more the
> > kernel performance suffers on those nodes.
> > 
> > If there is a local copy of the kernel text in each node's RAM, and
> > each node runs the kernel using its local copy of the kernel text,
> > then it stands to reason that the kernel will run faster due to fewer
> > stalls while instructions are fetched from remote memory.
> > 
> > The question then arises how to achieve this.
> > 
> > Background
> > ----------
> > 
> > An important issue to contend with is what happens when a thread
> > migrates between nodes. Essentially, the thread's state (including
> > instruction pointer) is saved to memory, and the scheduler on that CPU
> > loads some other thread's state and that CPU resumes executing that
> > new thread.
> > 
> > The CPU gaining the migrating thread loads the saved state, again
> > including the instruction pointer, and the gaining CPU resumes fetching
> > instructions at the virtual address where the original CPU left off.
> > 
> > The key point is that the virtual address is what matters here, and
> > this gives us a way to implement kernel text replication fairly easily.
> > At a practical level, all we need to do is to ensure that the virtual
> > addresses which contain the kernel text point to a local copy of the
> > that text.
> > 
> > This is exactly how this proposal of kernel text replication achieves
> > the replication. We can go a little bit further and include most of
> > the read-only data in this replication, as that will never be written
> > to by the kernel (and thus remains constant.)
> > 
> > Solution
> > --------
> > 
> > So, what we need to achieve is:
> > 
> > 1. multiple identical copies of the kernel text (and read-only data)
> > 2. point the virtual mappings to the appropriate copy of kernel text
> >    for the NUMA node.
> > 
> > (1) is fairly easy to achieve - we just need to allocate some memory
> > in the appropriate node and copy the parts of the kernel we want to
> > replicate. However, we also need to deal with ARM64's kernel patching.
> > There are two functions that patch the kernel text,
> > __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> > these need to to be modified to update all copies of the kernel text.
> > 
> > (2) is slightly harder.
> > 
> > Firstly, the aarch64 architecture has a very useful feature here - the
> > kernel page tables are entirely separate from the user page tables.
> > The hardware contains two page table pointers, one is used for user
> > mappings, the other is used for kernel mappings.
> > 
> > Therefore, we only have one page table to be concerned with: the table
> > which maps kernel space. We do not need to be concerned with each
> > user processes page table.
> > 
> > The approach taken here is to ensure that the kernel is located in an
> > area of kernel virtual address space covered by a level-0 page table
> > entry which is not shared with any other user. We can then maintain
> > separate per-node level-0 page tables for kernel space where the only
> > difference between them is this level-0 page table entry.
> > 
> > This gives a couple of benefits. Firstly, when updates to the level-0
> > page table happen (e.g. when establishing new mappings) these updates
> > can simply be copied to the other level-0 page tables provided it isn't
> > for the kernel image. Secondly, we don't need complexity at lower
> > levels of the page table code to figure out whether a level-1 or lower
> > update needs to be propagated to other nodes.
> > 
> > The level-0 page table entry for the kernel can then be used to point
> > at a node-unique set of level 1..N page tables to make the appropriate
> > copy of the kernel text (and read-only data) into kernel space, while
> > keeping the kernel read-write data shared between nodes.
> > 
> > Performance Analysis
> > --------------------
> > 
> > Needless to say, the performance results from kernel text replication
> > are workload specific, but appear to show a gain of between 6% and
> > 17% for database-centric like workloads. When combined with userspace
> > awareness of NUMA, this can result in a gain of over 50%.
> > 
> > Problems
> > --------
> > 
> > There are a few areas that are a problem for kernel text replication:
> > 1) As this series changes the kernel space virtual address space
> >    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
> >    have no idea how to fix it. I would be grateful for input from
> >    KASAN folk for suggestions how to fix this.
> > 
> > 2) KASLR can not be used with kernel text replication, since we need
> >    to place the kernel in its own L0 page table entry, not in vmalloc
> >    space. KASLR is disabled when support for kernel text replication
> >    is enabled.
> > 
> > 3) Changing the kernel virtual address space layout also means that
> >    kaslr_offset() and kaslr_enabled() need to become macros rather
> >    than inline functions due to the use of PGDIR_SIZE in the
> >    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
> >    constant, but asm/memory.h is included by asm/pgtable.h, having
> >    this symbol available would produce a circular include
> >    dependency, so I don't think there is any choice here.
> > 
> > 4) read-only protection for replicated kernel images is not yet
> >    implemented.
> > 
> > Patch overview:
> > 
> > Patch 1 cleans up the rox page protection logic.
> > Patch 2 reoganises the kernel virtual address space layout (causing
> >   problems (1 and 3).
> > Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
> >   addresses.
> > Patch 4 makes a needed cache flushing function visible.
> > Patch 5 through 16 are the guts of kernel text replication.
> > Patch 17 adds the Kconfig entry for it.
> > 
> > Further patches not included in this set add a Kconfig for the default
> > state, a test module, and add code to verify the replicated kernel
> > text matches the node 0 text after the kernel has completed most of
> > its boot.
> > 
> >  Documentation/admin-guide/kernel-parameters.txt |   5 +
> >  arch/arm64/Kconfig                              |  10 +-
> >  arch/arm64/include/asm/cacheflush.h             |   2 +
> >  arch/arm64/include/asm/ktext.h                  |  45 ++++++
> >  arch/arm64/include/asm/memory.h                 |  26 ++--
> >  arch/arm64/include/asm/mmu_context.h            |  12 +-
> >  arch/arm64/include/asm/pgtable.h                |  35 ++++-
> >  arch/arm64/include/asm/smp.h                    |   1 +
> >  arch/arm64/kernel/alternative.c                 |   4 +-
> >  arch/arm64/kernel/asm-offsets.c                 |   1 +
> >  arch/arm64/kernel/cpufeature.c                  |   2 +-
> >  arch/arm64/kernel/head.S                        |   3 +-
> >  arch/arm64/kernel/hibernate.c                   |   2 +-
> >  arch/arm64/kernel/patching.c                    |   7 +-
> >  arch/arm64/kernel/smp.c                         |   3 +
> >  arch/arm64/kernel/suspend.c                     |   3 +-
> >  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
> >  arch/arm64/mm/Makefile                          |   2 +
> >  arch/arm64/mm/init.c                            |   3 +
> >  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c                             |  85 ++++++++--
> >  21 files changed, 413 insertions(+), 39 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ktext.h
> >  create mode 100644 arch/arm64/mm/ktext.c
> > 
> > 
> > -- 
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> > 
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



More information about the linux-arm-kernel mailing list