[RFC PATCH v2 00/13] nommu UML

Fri Nov 15 06:48:03 PST 2024

Hello Johannes,

# added Geert, Greg, Rich to Cc (sorry if you feel noisy)
# here is the original email of this thread: just in case.
# https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime@gmail.com/

On Fri, 15 Nov 2024 19:12:39 +0900,
Johannes Berg wrote:
> 
> On Mon, 2024-11-11 at 15:27 +0900, Hajime Tazaki wrote:
> > This is a series of patches of nommu arch addition to UML.  It would
> > be nice to ask comments/opinions on this.
> 
> So I've been thinking about this for a while now...

thank you for your time !

> To be clear, I'm not really _against_ it. With around 1200 lines of
> code, it really isn't even big. But I also don't know how brittle it is?
> Testing it is made somewhat difficult with the map-at-zero requirement
> too.

Given the recent situation that CI/testing facilities running are on
VMs, configuring /proc/sys/vm/mmap_min_addr=0 is not so difficult
in order to test this feature.

> And really I keep coming back to asking myself what the use case is?
> 
> Is it to test something for no-MMU platforms more easily? But I'm not
> sure what that would be? Have any no-MMU platform maintainers weighed in
> on this, have they even _seen_ it? Is that interesting? Is it more
> interesting than testing an emulated system with the right architecture?

Let me explain one recent experience for the use case.

I spotted (and fixed, now in linus tree) an issue of vma subsystem
using the maple-tree library, during this development of patch series.

There is a (slightly) long thread here to discuss with the maple-tree
maintainer, Liam (below).

- traversing vma on nommu
https://lists.infradead.org/pipermail/maple-tree/2024-November/003740.html

The issue was bisected that I can reproduce it after v6.12-rc1, but
never happened with the other nommu arch (we tested with m68k and
riscv, both on buildroot qemu).  maybe because I'm familiar with nommu
UML than m68k/riscv qemu, I could comfortably reproduce/debug/test
what's going on with gdb, and finally proposed a fix (one-liner
patch).

- the patch (hope it'll be landed on 6.12 release)
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=247d720b2c5d22f7281437fd6054a138256986ba

This is only a case of usefulness.  I believe you can also imagine
that this also can happen with regular (MMU) UML.

I also privately run a CI test which verifies that my patch doesn't
break MMU UML, with a simple boot test (static/dynamic), 12 kunit
tests in kernel tree, basic benchmarks with lmbench, etc.  This is not
specific characteristics of nommu UML though.

https://github.com/thehajime/linux/actions/runs/11811327291
# The above URL may expire in future.

> With it this way you'd probably have to build the right libraries and
> binaries for x86-64 no-MMU, does such a thing already exist somewhere?

I'm preparing the patches to upstream Alpine Linux for such binaries
to be available in an appropriate way.  Note that I didn't modify the
code of programs itself (except a clear bug), just build with NOMMU
option which is already implemented in busybox/musl-libc.

https://gitlab.alpinelinux.org/thehajime/aports/-/merge_requests/2/diffs

I have not contacted to the upstream developer so, this diff might be changed.

> It also doesn't look like it's meant to replace LKL? But even LKL I
> don't really know - are people using it, and if so what for? Seems
> lklfuse is a thing for some BSD folks?
> 
> Is there something else to use it for?

This patchset is independent and nothing related to LKL.
# you may confuse that I've still been working on LKL.

(off topic)
lklsue is indeed used by FreeBSD but not well maintained (afaik).
NixOS (a linux pkg manager) also use lklfuse iirc.

> If it's the first (test no-MMU) then it probably should be smarter about
> not really relying on retpoline.

# I assume s/retpoline/zpoline/ in the rest of your message.

> Why is the focus so much on that
> anyway? If testing no-MMU was the most important thing then probably
> you'd have started with seccomp, and actually execute the syscalls from
> that, to not have all those restrictions that come from rewriting
> binaries, rather than ignoring the whole thing.

For the JIT part (and also syscalls from dlopen-ed binaries), as I
mentioned in the other reply, it can be implemented but not yet for
now.

The choice of zpoline is based on the speed of syscall invocations.
We have investigated that seccomp (and similar mechanism like SUD:
syscall user dispatch, ptrace, int3 signaling) are still slower than
binary rewrites, as the nature of signal delivery in its mechanism.
LD_PRELOAD with symbol rewrites is faster (even than binary rewrites)
but fundamentally cannot hook all syscalls.

zpoline tries to fill this gap, and we thought this fits the UML
usage.

> Though of course you did
> add a filter now, but I think it'll just crash?

this part (just crash w/ SIGSYS) can be improved.

> So I could perhaps see this use case, but then I'd probably think it
> should be more generic (i.e. able to execute all no-MMU binaries
> including ones that may be using JIT compilation etc.) and not _require_
> retpoline, but rather use it as an optimisation where that's possible
> (i.e. if you can map at zero)?

I understand your point.

> If the use case instead of more LKL-type usage, I guess I don't really
> understand it, though to be honest I also don't really fully understand
> LKL itself, but it always _seemed_ very different.

I didn't explain the comparison between LKL v.s. nommu UML, as I
thought those are independent from each other.

> Somewhat hyperbolically, I'm wondering if it's just a tech demo for
> retpoline?

Additional reason we used zpoline to replace syscall instruction is:

our first implementation of this nommu UML used modified version of
(userspace) standard library (musl-libc), without zpoline.  We
reimplemented syscall wrappers to call a syscall entry point
(__kernel_vsyscall) exposed by ELF aux vector.

Like this:

static __inline long __syscall0(long n)
{
	unsigned long ret = -1;
        __asm__ __volatile__ ("call *%1" : "=a"(ret)
			: "r"(__sysinfo), "a"(n)
			: "rcx", "r11", "memory");
	return ret;
}
# __sysinfo is exposed address from the aux vector.
# this was actually done not by myself, but Ricardo (in Cc)'s work.

https://github.com/nabla-containers/musl-libc/blob/e11be13e6abc06f7034d6b98552b5928d0ed0dfe/arch/x86_64/syscall_arch.h#L13-L20

with that, we can use unmodified binaries, but need to modify libc.so
and ld.so, which isn't trivial I thought.

My motivation to apply zpoline here is to eliminate this dependency;
with zpoline, we don't have to modify the standard library (musl).

In addition to that, since NOMMU kernel shares address space among
multiple userspace processes, we only have to prepare a trampoline
code a single time, while processes in multiple address space model
(in MMU case) needs to install those zpoline related code per each
process invocation.  This is not direct motivation to use zpoline
here, but side-benefit under the given environment.

> So I dunno. Reading through it again there are a few minor things wrt.
> code style and debug things left over, but it's not awful ;-)

oh really.  I'll double check them but would be nice to know any flaws
you found.

> I'd also
> prefer the code to be more clearly "marked" (as nommu), perhaps putting
> new files into a nommu/ directory, or something like that. But that's
> pretty minor.

I understand.  I'm afraid that it will be still multiple of ifdefs since
nommu UML relies on various part of existing UML infrastructure.

> Still it's in a lot of places and chances are it'll make bigger
> refactoring (like seccomp mode) harder. Perhaps if at all it should come
> after seccomp mode and use that to execute syscalls if zpoline can't be
> done, and to catch all the cases where zpoline doesn't work (you have
> that in the docs)?

fallback mechanism after zpoline failure might be interesting.

> What do others think? Would you use it? What for?

-- Hajime