[PATCH v7 2/3] kho: fix deferred init of kho scratch

Wed Mar 18 10:08:26 PDT 2026

On 18 Mar 2026, at 11:45, Michał Cłapiński wrote:

> On Wed, Mar 18, 2026 at 4:26 PM Zi Yan <ziy at nvidia.com> wrote:
>>
>> On 18 Mar 2026, at 11:18, Michał Cłapiński wrote:
>>
>>> On Wed, Mar 18, 2026 at 4:10 PM Zi Yan <ziy at nvidia.com> wrote:
>>>>
>>>> On 17 Mar 2026, at 10:15, Michal Clapinski wrote:
>>>>
>>>>> Currently, if DEFERRED is enabled, kho_release_scratch will initialize
>>>>> the struct pages and set migratetype of kho scratch. Unless the whole
>>>>> scratch fit below first_deferred_pfn, some of that will be overwritten
>>>>> either by deferred_init_pages or memmap_init_reserved_pages.
>>>>>
>>>>> To fix it, I modified kho_release_scratch to only set the migratetype
>>>>> on already initialized pages. Then, modified init_pageblock_migratetype
>>>>> to set the migratetype to CMA if the page is located inside scratch.
>>>>>
>>>>> Signed-off-by: Michal Clapinski <mclapinski at google.com>
>>>>> ---
>>>>>  include/linux/memblock.h           |  2 --
>>>>>  kernel/liveupdate/kexec_handover.c | 10 ++++++----
>>>>>  mm/memblock.c                      | 22 ----------------------
>>>>>  mm/page_alloc.c                    |  7 +++++++
>>>>>  4 files changed, 13 insertions(+), 28 deletions(-)
>>>>>
>>>>
>>>> <snip>
>>>>
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index ee81f5c67c18..5ca078dde61d 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -55,6 +55,7 @@
>>>>>  #include <linux/cacheinfo.h>
>>>>>  #include <linux/pgalloc_tag.h>
>>>>>  #include <linux/mmzone_lock.h>
>>>>> +#include <linux/kexec_handover.h>
>>>>>  #include <asm/div64.h>
>>>>>  #include "internal.h"
>>>>>  #include "shuffle.h"
>>>>> @@ -549,6 +550,12 @@ void __meminit init_pageblock_migratetype(struct page *page,
>>>>>                    migratetype < MIGRATE_PCPTYPES))
>>>>>               migratetype = MIGRATE_UNMOVABLE;
>>>>>
>>>>> +     /*
>>>>> +      * Mark KHO scratch as CMA so no unmovable allocations are made there.
>>>>> +      */
>>>>> +     if (unlikely(kho_scratch_overlap(page_to_phys(page), PAGE_SIZE)))
>>>>> +             migratetype = MIGRATE_CMA;
>>>>> +
>>>>
>>>> If this is only for deferred init code, why not put it in deferred_free_pages()?
>>>> Otherwise, all init_pageblock_migratetype() callers need to pay the penalty
>>>> of traversing kho_scratch array.
>>>
>>> Because reserve_bootmem_region() doesn't call deferred_free_pages().
>>> So I would also have to modify it.
>>>
>>> And the early initialization won't pay the penalty of traversing the
>>> kho_scratch array, since then kho_scratch is NULL.
>>
>> How about hugetlb_bootmem_init_migratetype(), init_cma_pageblock(),
>> init_cma_reserved_pageblock(), __init_page_from_nid(), memmap_init_range(),
>> __init_zone_device_page()?
>>
>> 1. are they having any PFN range overlapping with kho?
>> 2. is kho_scratch NULL for them?
>>
>> 1 tells us whether putting code in init_pageblock_migratetype() could save
>> the hassle of changing all above locations.
>> 2 tells us how many callers are affected by traversing kho_scratch.
>
> I could try answering those questions but
>
> 1. I'm new to this and I'm not sure how correct the answers will be.
>
> 2. If you're not using CONFIG_KEXEC_HANDOVER, the performance penalty
> will be zero.
> If you are using it, currently you have to disable
> CONFIG_DEFERRED_STRUCT_PAGE_INIT and the performance hit from this is
> far, far greater. This solution saves 0.5s on my setup (100GB of
> memory). We can always improve the performance further in the future.
>

OK, I asked Claude for help and the answer is that not all callers of
init_pageblock_migratetype() touch kho scratch memory regions. Basically,
you only need to perform the kho_scratch_overlap() check in
__init_page_from_nid() to achieve the same end result.

The below is the analysis from Claude.
Based on my understanding,
1. memmap_init_range() is done before kho_memory_init(), so it does not need
the check.

2. __init_zone_device_page() is not relevant.

3. init_cma_reserved_pageblock() / init_cma_pageblock() are already set
to MIGRATE_CMA.

4. hugetlb is not used by kho scratch, so also does not need the check.

5. kho_release_scratch() already takes care of it.

The remaining memblock_free_pages() needs a check, but I am not 100%.

# kho_scratch_overlap() in init_pageblock_migratetype() — scope analysis

## Context

Commit a7700b3c6779 ("kho: fix deferred init of kho scratch") added a
kho_scratch_overlap() call inside init_pageblock_migratetype() in
mm/page_alloc.c:

```c
if (unlikely(kho_scratch_overlap(page_to_phys(page), PAGE_SIZE)))
    migratetype = MIGRATE_CMA;
```

kho_scratch_overlap() does a NULL check followed by a loop over the
kho_scratch array. For non-KHO boots (kho_scratch == NULL) the cost is
a single NULL load and branch. For KHO boots the loop runs on every call
to init_pageblock_migratetype().

## Question

Does this add overhead for callers whose memory range cannot overlap
with scratch? Can the check be moved to the caller side?

## Call site analysis

init_pageblock_migratetype() has nine call sites. The init call ordering
relevant to scratch is:

```
setup_arch()
  zone_sizes_init() -> free_area_init() -> memmap_init_range()   [1]

mm_init_free_all() / start_kernel():
  kho_memory_init() -> kho_release_scratch()                     [2]
  memblock_free_all()
    free_low_memory_core_early()
      memmap_init_reserved_pages()
        reserve_bootmem_region() -> __init_deferred_page()
          -> __init_page_from_nid()                              [3]
  deferred init kthreads -> __init_page_from_nid()               [4]
```

### Per call site

**mm/mm_init.c — __init_page_from_nid() (deferred init)**

Called for every deferred pfn (>= first_deferred_pfn). Scratch pages
in the deferred range are not touched by kho_release_scratch() (new
code clips end_pfn to first_deferred_pfn) and not touched by
memmap_init_range() (stops at first_deferred_pfn). This path sets
MIGRATE_MOVABLE on deferred scratch pageblocks after
kho_release_scratch() has already run.

**Needs the fix: yes.**

Both sub-paths that reach this function for deferred scratch pages:
- deferred init kthreads [4]
- reserve_bootmem_region() -> __init_deferred_page() [3]
  (early_page_initialised() returns early for non-deferred pfns, so
  __init_page_from_nid() is only reached for deferred pfns here too)

**mm/mm_init.c — memmap_init_range()**

Runs during setup_arch() [1], before kho_memory_init() [2]. Sets
MIGRATE_MOVABLE on scratch pageblocks, but kho_release_scratch() runs
afterward and correctly overrides to MIGRATE_CMA for non-deferred
scratch. For deferred scratch, memmap_init_range() stops at
first_deferred_pfn and never processes them.

**Needs the fix: no.**

**mm/mm_init.c — __init_zone_device_page()**

ZONE_DEVICE path only. Scratch is normal RAM, not ZONE_DEVICE.

**Needs the fix: no.**

**mm/mm_init.c — memblock_free_pages() (lines ~2012 and ~2023)**

Called by memblock_free_all() for free (non-reserved) memblock regions.
Scratch is memblock-reserved and released through the CMA path, not
through memblock_free_all().

**Needs the fix: no.**

**mm/mm_init.c — init_cma_reserved_pageblock() / init_cma_pageblock()**

Both already pass MIGRATE_CMA. The kho_scratch_overlap() check would
be redundant even if scratch reaches these paths.

**Needs the fix: no (redundant).**

**mm/hugetlb.c — __prep_compound_gigantic_folio()**

Gigantic hugepage setup. Scratch regions are not used for gigantic
hugepages.

**Needs the fix: no.**

**kernel/liveupdate/kexec_handover.c — kho_release_scratch()**

Already passes MIGRATE_CMA. Additionally, kho_scratch is NULL at the
point kho_release_scratch() runs (kho_memory_init() sets kho_scratch
only after kho_release_scratch() returns), so kho_scratch_overlap()
would return false regardless.

**Needs the fix: no.**

## Conclusion

The only path that actually requires the MIGRATE_CMA override is
__init_page_from_nid(). All problematic sub-paths (deferred init
kthreads and reserve_bootmem_region()) converge there.

The check could be moved to __init_page_from_nid() to keep the
KHO-specific concern out of the generic init_pageblock_migratetype():

```c
/* mm/mm_init.c: __init_page_from_nid() */
if (pageblock_aligned(pfn)) {
    enum migratetype mt = MIGRATE_MOVABLE;
    if (kho_scratch_overlap(PFN_PHYS(pfn), PAGE_SIZE))
        mt = MIGRATE_CMA;
    init_pageblock_migratetype(pfn_to_page(pfn), mt, false);
}
```

__init_page_from_nid() is only compiled under CONFIG_DEFERRED_STRUCT_PAGE_INIT,
which is the only configuration where the bug can occur, so the
kho_scratch_overlap() call would be naturally gated by that config.

Best Regards,
Yan, Zi