[PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Mon May 5 07:35:57 PDT 2025

On 4/29/25 1:01 AM, Chris Li wrote:
> Hi Andrey,
> 
> I am working on the PCI portion of the live update and looking at
> using KSTATE as an alternative to the FDT. Here are some high level
> feedbacks.
> 

Hi, thanks  a lot.

> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn at yandex-team.com> wrote:
>>
>>  Main changes from v1 [1]:
>>   - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
>>   - Lots of misc cleanups/refactorings.
>>
>> kstate (kernel state) is a mechanism to describe internal some part of the
>> kernel state, save it into the memory and restore the state after kexec
>> in the new kernel.
>>
>> The end goal here and the main use case for this is to be able to
>> update host kernel under VMs with VFIO pass-through devices running
>> on that host. Since we are pretty far from that end goal yet, this
>> only establishes some basic infrastructure to describe and migrate complex
>> in-kernel states.
>>
>> The idea behind KSTATE resembles QEMU's migration framework [1], which
>> solves quite similar problem - migrate state of VM/emulated devices
>> across different versions of QEMU.
>>
>> This is an altenative to Kexec Hand Over (KHO [3]).
>>
>> So, why not KHO?
>>
> 
> KHO does more than just serializing/unserializing. It also has scratch
> areas etc to allow safely performing early allocation without stepping
> on the preserved memory. I see KSTATE as an alternative to libFDT as
> ways of serializing the preserved memory. Not a replacement for KHO.
> 
> With that, it would be great to see a KSTATE build on top of the
> current version of KHO. The V6 version of KHO uses a recursive FDT
> object. I see recursive FDT can map to the C struct description
> similar to the KSTATE field description nicely. However, that will
> require KSTATE to make some considerable changes to embrace the KHO
> v6. For example, the KSTATE uses one contiguous stream buffer and KHO
> V6 uses many recursive physical address object pointers for different
> objects.  Maybe a KSTATE V3?
> 

Yep, I'll take a look into combinig KSTATE with KHO.

....

> 
>>      a_kho_restore()
>>      {
>>              ...
>>              a.i = fdt_getprop(fdt, offset, "i", &len);
>>              if (!a.i || len != sizeof(a.i))
>>                 goto err
>>              *a.p_ulong = fdt_getprop....
>>      }
>>
>>     Each driver/subsystem has to solve this problem in their own way.
>>     Also if we use fdt properties for individual fields, that might be wastefull
>>     in terms of used memory, as these properties use strings as keys.
> 
> Right, I need to write a lot of boilerplate code to do the per
> property save/restore. I am not worried too much about memory usage. A
> lot of string keys are not much longer than 8 bytes. The memory saving
> convert to binary index is not huge. I actually would suggest adding
> the string version of the field name to the description table, so that
> we can dump the state in KSTATE just like the YAML FDT output for
> debugging purposes. It is a very useful feature of FDT to dump the
> current saving state into a human readable form. KSTATE can have the
> same feature added.
> 

kstate_field already have string with name of the field:

#define KSTATE_BASE_TYPE(_f, _state, _type) {		\
	.name = (__stringify(_f)),			\

Currently it's not used in code, but it's there for debug purposes 

>>    While with KSTATE solves the same problem in more elegant way, with this:
>>         struct kstate_description a_state = {
>>                 .name = "a_struct",
>>                 .version_id = 1,
>>                 .id = KSTATE_TEST_ID,
>>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>>                 .fields = (const struct kstate_field[]) {
>>                         KSTATE_BASE_TYPE(i, struct a, int),
>>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>>                         KSTATE_POINTER(p_ulong, struct a),
>>                         KSTATE_PAGE(page, struct a),
>>                         KSTATE_END_OF_LIST()
>>                 },
>>         };
>>
>>
>>         {
>>                 static unsigned long ulong
>>                 static struct a a_data = { .p_ulong = &ulong };
>>
>>                 kstate_register(&test_state, &a_data);
>>         }
>>
>>        The driver needs only to have a proper 'kstate_description' and call kstate_register()
>>        to save/restore a_data.
>>        Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
>>        And kstate_register() does all this save/restore stuff under the hood.
> 
> It seems the KSTATE uses one contiguous stream and the object has to
> be loaded in the order it was saved. For the PCI code, the PCI device
> scanning and probing might cause the device load out of the order of
> saving. (The PCI probing is actually the reverse order of saving).
> This kstate_register() might pose restrictions on the restore order.
> PCI will need to look up and find the device state based on the PCI
> device ID. Other subsystems will likely have the requirement to look
> up their own saved state as well.
> I also see KSTATE can be extended to support that.
> 

Absolutely agreed. I think we need to decouple restore and register, ie remove
 restore_misgrate_state() from kstate_register(). Add instance_id argument to kstate_register(),
so the PCI code could do:
            kstate_register(&pci_state, pdev, PCI_DEVID(pdev->bus->number, pdev->devfn));

And on probing stage (probably in pci_device_add()) call
        kstate_restore(&pci_state, dev, PCI_DEVID(bus->number, dev->devfn))

which would locate state for the device if any and restore it.

>>  - Another bonus point - kstate can preserve migratable memory, which is required
>>     to preserve guest memory
>>
>>
>> So now to the part how this works.
>>
>> State of kernel data (usually it's some struct) is described by the
>> 'struct kstate_description' containing the array of individual
>> fields descpriptions - 'struct kstate_field'. Each field
>> has set of bits in ->flags which instructs how to save/restore
>> a certain field of the struct. E.g.:
>>   - KS_BASE_TYPE flag tells that field can be just copied by value,
>>
>>   - KS_POINTER means that the struct member is a pointer to the actual
>>      data, so it needs to be dereference before saving/restoring data
>>      to/from kstate data steam.
>>
>>   - KS_STRUCT - contains another struct,  field->ksd must point to
>>       another 'struct kstate_dscription'
> 
> The field can't have both bits set for KS_BASE_TYPE and KS_STRUCT
> type, right? Some of these flag combinations do not make sense. This
> part might need more careful planning to keep it simple. Maybe some of
> the flags bits should be enum.
> 

Yes, this needs more thought. Mutually exclusive flags could be moved in separate enum field.
 Some may be not needed at all. e.g. instead of KS_STRUCT we could just check if (field->ksd != NULL)

>>
>>   - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
>>                ->restore() callbacks to save/restore data.
>>
>>   - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
>>                          field->count() callback
>>   - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
>>      linear address. Store offset
> 
> I think we want to describe different stream types.
> For example the most simple stream container is just a contiguous
> buffer with start address and size.
> The more complex one might be and size then an array of page pointers,
> all those pointers add to up the new buffer which describe an saved
> KSTATE that is larger than 4K and spread into an array of pages. Those
> pages don't need to be contiguous. Such a page array buffer stores the
> KSTATE entry described by a separate description table.
> 

Agreed, I had similar thoughts. But this complicates code, so I started with
something simple.

>>
>>   - KS_END - special flag indicating the end of migration stream data.
>>
>> kstate_register() call accepts kstate_description along with an instance
>> of an object and registers it in the global 'states' list.
>>
>> During kexec reboot phase we go through the list of 'kstate_description's
>> and each instance of kstate_description forms the 'struct kstate_entry'
>> which save into the kstate's data stream.
>>
>> The 'kstate_entry' contains information like ID of kstate_description, version
>> of it, size of migration data and the data itself. The ->data is formed in
>> accordance to the kstate_field's of the corresponding kstate_description.
> 
> The version for the kstate_description might not be enough. The
> version works if there is a linear history. Here we are likely to have
> different vendors add their own extension to the device state saving.

I think vendors can just declare separate kstate_description with different ID.
The only problem with this, is that kstate_description.id is integer, so it would be
a problem to allocate those without conflicts.
Perhaps change the id to string? So vendors can just add vendor prefix to ID.

> I suggest instead we save the old kernel's kstate_description table
> (once per description table as a recursive object)  alongside the
> object physical address as well. The new kernel has their new version
> of the description table. It can compare between the old and new
> description tables and find out what fields need to be upgraded or
> downgraded. The new kernel will use the old kstate_description table
> to decode the previous kernel's saved object.

Hmm.. I'm not sure, there is a lot to think about.
 This might make changes in kstate_description painful,
e.g. if I want to rearrange some ->flags for whatever reason.
So how to deal with changes in kstate_description itself?

How do we save links to methods in kstate_field (->restore/->save/->count),
and what if we'll need to change function prototypes of these methods ?

> I think that way it is> more flexible to support adding one or more features and not tight to
> which version has what feature. It can also make sure the new kernel
> can always dump the old KSTATE into YAML.
> 
> That way we might be able to simplify the subsection and the
> depreciation flags. The new kernel doesn't need to carry the history
> of changes made to the old description table.
> 
>> After the reboot, when the kstate_register() called it parses migration
>> stream, finds the appropriate 'kstate_entry' and restores the contents of
>> the object in accordance with kstate_description and ->fields.
> 
> Again this restoring can happen in a different order when the PCI
> device scanning and probing order. The restoration might not happen in
> one single call chain. Material for V3?

Agreed.

> I am happy to work with you to get KSTATE working with the existing KHO effort.
> 
Thanks for useful feedback, appreciated.