[PATCH v3 00/17] kexec: Allow preservation of ftrace buffers

Fri Feb 2 04:58:52 PST 2024

Hi Philipp,

On 29.01.24 17:34, Philipp Rudo wrote:
> Hi Alex,
>
> adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
> get KHO to work).
>
> Fist of all I believe that having a generic framework to pass information from
> one kernel to the other across kexec would be a good thing. But I'm afraid that

Thanks, I'm happy to hear that you agree with the basic motivation :). 
There are fundamentally 2 problems with passing data:

   * Passing structured data in a cross-architecture way
   * Passing memory

KHO tackles both. It proposes a common FDT based format that allows us 
to pass per-subsystem properties. That way, a subsystem does not need to 
know whether it's running on ARM, x86, RISC-V or s390x. It just gains 
awareness for KHO and can pass data.

On top of that, it proposes a standardized "mem" property (and some 
magic around that) which allows subsystems to pass memory.

> you are ignoring some fundamental problems which makes it extremely hard, if
> not impossible, to reliably transfer the kernel's state from one kernel to the
> other.
>
> One thing I don't understand is how reusing the scratch area is working. Sure
> you pass it's location via the dt/boot_params but I don't see any code that
> makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
> kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
> runs inside that area and I don't believe the CMA design ever considered that
> the kernel image could be included in a CMA area.

That one took me a lot to figure out sensibly (with recursion all the 
way down) while building KHO :). I hope I detailed it sensibly in the 
documentation - please let me know how to improve it in case it's 
unclear: https://lore.kernel.org/lkml/20240117144704.602-8-graf@amazon.com/

Let me explain inline using different words as well what happens:

The first (and only the first) kernel that boots allocates a CMA region 
as "scratch region". It loads the new kernel into that region. It passes 
that region as "scratch region" to the next kernel. The next kernel now 
takes it and marks every page block that the scratch region spans as CMA:

https://lore.kernel.org/lkml/20240117144704.602-3-graf@amazon.com/

The CMA hint doesn't mean we create an actual CMA region. It mostly 
means that the kernel won't use this memory for any kernel allocations. 
Kernel allocations up to this point are allocations we don't need to 
pass on with KHO again. Kernel allocations past that point may be 
allocations that we want to pass, so we just never place them into the 
"scratch region" again.

And because we now already have a scratch region from the previous 
kernel, we keep reusing that forever with any new KHO kexec.

> Staying at reusing the scratch area. One thing that is broken for sure is that
> you reuse the scratch area without ever checking the kho_scratch parameter of
> the 2nd kernel's command line. Remember, with kexec you are dealing with two
> different kernels with two different command lines. Meaning you can only reuse
> the scratch area if the requested size in the 2nd kernel is identical to the
> one of the 1st kernel. In all other cases you need to adjust the scratch area's
> size or reserve a new one.

Hm. So you're saying a user may want to change the size of the scratch 
area with a KHO kexec. That's insanely risky because you (as rightfully 
pointed out below) may have significant fragmentation at that point. And 
we will only know when we're in the new kernel so it's too late to 
abort. IMHO it's better to just declare the scratch region as immutable 
during KHO to avoid that pitfall.

> This directly leads to the next problem. In kho_reserve_previous_mem you are
> reusing the different memory regions wherever the 1st kernel allocated them.
> But that also means you are handing over the 1st kernel's memory
> fragmentation to the 2nd kernel and you do that extremely early during boot.
> Which means that users who need to allocate large continuous physical memory,
> like the scratch area or the crashkernel memory, will have increasing chance to
> not find a suitable area. Which IMHO is unacceptable.

Correct :). It basically means you want to pass large allocations from 
the 1st kernel that you want to preserve on to the next. So if the 1st 
kernel allocated a large crash area, it's safest to pass that allocation 
using KHO to ensure the next kernel also has the region fully reserved. 
Otherwise the next kernel may accidentally place data into the 
previously reserved crash region (which would be contiguously free at 
early init of the 2nd kernel) and fragment it again.

> Finally, and that's the big elephant in the room, is your lax handling of the
> unstable kernel internal ABI. Remember, you are dealing with two different
> kernels, that also means two different source levels and two different configs.
> So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
> doesn't means that they have the same struct buffer_page. But that's what your
> code implicitly assumes. For KHO ever to make it upstream you need to make sure
> that both kernels are "speaking the same language".

Wow, I hope it didn't come across as that! The whole point of using FDT 
and compatible strings in KHO is to solve exactly that problem. Any time 
a passed over data structure changes incompatibly, you would need to 
modify the compatible string of the subsystem that owns the now 
incompatible data.

So in the example of struct buffer_page, it means that if anyone changes 
the few bits we care about in struct buffer_page, we need to ensure that 
the new kernel emits "ftrace,cpu-v2" compatible strings. We can at that 
point choose whether we want to implement compat handling for 
"ftrace,cpu-v1" style struct buffer_pages or only support same version 
ingestion.

The one thing that we could improve on here today IMHO is to have 
compile time errors if any part of struct buffer_page changes 
semantically: So we'd create a few defines for the bits we want in 
"ftrace,cpu-v1" as well as size of struct buffer_page and then compare 
them to what the struct offsets are at compile time to ensure they stay 
identical.

Please let me know how I can clarify that more in the documentation. It 
really is the absolute core of KHO.

> Personally I see two possible solutions:
>
> 1) You introduce a stable intermediate format for every subsystem similar to
> what IMA_KEXEC does. This should work for simple types like struct buffer_page
> but for complex ones like struct vfio_device that's basically impossible.

I don't see why. The only reason KHO passes struct buffer_page as memory 
is because we want to be able to produce traces even after KHO 
serialization is done. For vfio_device, I think it's perfectly 
reasonable to serialize any data we need to preserve directly into FDT 
properties.

> 2) You also hand over the ABI version for every given type (basically just a
> hash over all fields including all the dependencies). So the 2nd kernel can
> verify that the data handed over is in a format it can handle and if not bail
> out with a descriptive error message rather than reading garbage. Plus side is
> that once such a system is in place you can reuse it to automatically resolve
> all dependencies so you no longer need to manually store the buffer_page and
> its buffer_data_page separately.
> Down side is that traversing the debuginfo (including the ones from modules) is
> not a simple task and I expect that such a system will be way more complex than
> the rest of KHO. In addition there are some cases that the versioning won't be
> able to capture. For example if a type contains a "void *"-field. Then although
> the definition of the type is identical in both kernels the field can be cast
> to different types when used. An other problem will be function pointers which
> you first need to resolve in the 1st kernel and then map to the identical
> function in the 2nd kernel. This will become particularly "fun" when the
> function is part of a module that isn't loaded at the time when you try to
> recreate the kernel's state.

The whole point of KHO is to leave it to the subsystem which path they 
want to take. The subsystem can either pass binary data and validate as 
part of FDT properties (like compatible strings). That data can be 
identical to today's in-kernel data structures (usually a bad idea) or 
can be a new intermediate data format. But the subsystem can also choose 
to fully serialize into FDT properties and not pass any memory at all 
for state that would be in structs. Or something in between.

> So to summarize, while it would be nice to have a generic framework like KHO to
> pass data from one kernel to the other via kexec there are good reasons why it
> doesn't exist, yet.

I hope my explanations above clarify things a bit. Let me know if you're 
at FOSDEM, happy to talk about the internals there as well :)

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879