[PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers
Pasha Tatashin
pasha.tatashin at soleen.com
Mon Feb 10 12:58:00 PST 2025
On Mon, Feb 10, 2025 at 3:22 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> > new file mode 100644
> > index 000000000000..f13b252bc303
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> > @@ -0,0 +1,53 @@
> > +What: /sys/kernel/kho/active
> > +Date: December 2023
> > +Contact: Alexander Graf <graf at amazon.com>
> > +Description:
> > + Kexec HandOver (KHO) allows Linux to transition the state of
> > + compatible drivers into the next kexec'ed kernel. To do so,
> > + device drivers will serialize their current state into a DT.
> > + While the state is serialized, they are unable to perform
> > + any modifications to state that was serialized, such as
> > + handed over memory allocations.
> > +
> > + When this file contains "1", the system is in the transition
> > + state. When contains "0", it is not. To switch between the
> > + two states, echo the respective number into this file.
>
> I don't think this is a great interface for the actual state machine..
In our next proposal we are going to remove this "activate" phase.
>
> > +What: /sys/kernel/kho/dt_max
> > +Date: December 2023
> > +Contact: Alexander Graf <graf at amazon.com>
> > +Description:
> > + KHO needs to allocate a buffer for the DT that gets
> > + generated before it knows the final size. By default, it
> > + will allocate 10 MiB for it. You can write to this file
> > + to modify the size of that allocation.
>
> Seems gross, why can't it use a non-contiguous page list to generate
> the FDT? :\
We will consider some of these ideas in the future version. I like the
idea of using preserved memory to carry sparse KHO tree: i.e FDT over
sparse memory, maybe use the anchor page to describe how it should be
vmapped into a virtually contiguous tree in the next kernel?
>
> See below for a suggestion..
>
> > +static int kho_serialize(void)
> > +{
> > + void *fdt = NULL;
> > + int err = -ENOMEM;
> > +
> > + fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> > + if (!fdt)
> > + goto out;
> > +
> > + if (fdt_create(fdt, kho_out.dt_max)) {
> > + err = -EINVAL;
> > + goto out;
> > + }
> > +
> > + err = fdt_finish_reservemap(fdt);
> > + if (err)
> > + goto out;
> > +
> > + err = fdt_begin_node(fdt, "");
> > + if (err)
> > + goto out;
> > +
> > + err = fdt_property_string(fdt, "compatible", "kho-v1");
> > + if (err)
> > + goto out;
> > +
> > + /* Loop through all kho dump functions */
> > + err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> > + err = notifier_to_errno(err);
>
> I don't see this really working long term. I think we'd like each
> component to be able to serialize at its own pace under userspace
> control.
>
> This design requires that the whole thing be wrapped in a notifier
> callback just so we can make use of the fdt APIs.
>
> It seems like a poor fit me.
>
> IMHO if you want to keep using FDT I suggest that each serializing
> component (ie driver, ftrace whatever) allocate its own FDT fragment
> from scratch and the main KHO one just link to the memories that holds
> those fragements.
>
> Ie the driver experience would be more like
>
> kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);
>
> fdt...(kho->fdt..)
>
> kho_finish_storage(kho);
>
> Where this ends up creating a stand alone FDT fragment:
>
> /dts-v1/;
> / {
> compatible = "linux-kho,my_compatible_string,v1";
> instance = some_kind_of_instance_key;
> key-value-1 = <..>;
> key-value-1 = <..>;
> };
>
> And then kho_finish_storage() would remember the phys/length until the
> kexec fdt is produced as the very last step.
>
> This way we could do things like fdbox an iommufd and create the above
> FDT fragment completely seperately from any notifier chain and,
> crucially, disconnected from the fdt_create() for the kexec payload.
>
> Further, if you split things like this (it will waste some small
> amount of memory) you can probably get to a point where no single FDT
> is more than 4k. That looks like it would simplify/robustify alot of
> stuff?
>
> Jason
>
More information about the kexec
mailing list