[PATCH] kexec: Enable CMA based contiguous allocation
Alexander Graf
graf at amazon.com
Tue May 13 01:20:15 PDT 2025
Hi Andrew,
On 13.05.25 01:59, Andrew Morton wrote:
> On Mon, 12 May 2025 22:57:52 +0000 Alexander Graf <graf at amazon.com> wrote:
>
>> When booting a new kernel with kexec_file, the kernel picks a target
>> location that the kernel should live at, then allocates random pages,
>> checks whether any of those patches magically happens to coincide with
>> a target address range and if so, uses them for that range.
>>
>> For every page allocated this way, it then creates a page list that the
>> relocation code - code that executes while all CPUs are off and we are
>> just about to jump into the new kernel - copies to their final memory
>> location. We can not put them there before, because chances are pretty
>> good that at least some page in the target range is already in use by
>> the currently running Linux environment.
>>
>> All of this is inefficient.
>>
>> Since kexec got introduced, Linux has gained the CMA framework which
>> can perform physically contiguous memory mappings, while keeping that
>> memory available for movable memory when it is not needed for contiguous
>> allocations. The default CMA allocator is for DMA allocations.
>>
>> This patch adds logic to the kexec file loader to attempt to place the
>> target payload at a location allocated from CMA. If successful, it uses
>> that memory range directly instead of creating copy instructions during
>> the hot phase. To ensure that there is a safety net in case anything goes
>> wrong with the CMA allocation, it also adds a flag for user space to force
>> disable CMA allocations.
>>
>> Using CMA allocations has two advantages:
>>
>> 1) Faster. There is no more need to copy in the hot phase.
> How much faster? Kinda matters as "fast" is the whole point of the patch!
It mostly depends on your memory bandwidth, cache size and the amount of
payload you kexec into. Today, we copy kernel, initrd and dtb during the
kexec hot phase. With this patch, we copy none.
At a hypothetical payload size of 100MiB and an AMLogic S905 we're
looking at 50ms. On an Ice Lake system at about 4ms.
As for real life benchmarks, on an Apple M1 VM with 53MiB payload I
measure a 2.2ms speedup.
That said, to me robustness is at least as important as speed here. The
current scheme creates memory allocation conflicts almost by design,
which increase the odds of something going wrong.
>> 2) More robust. Even if by accident some page is still in use for DMA,
>> the new kernel image will be safe from that access because it resides
>> in a memory region that is considered allocated in the old kernel and
>> has a chance to reinitialize that component.
> Is this known to be a problem in current code?
Yes, see David's email. I'll clarify in the patch description.
>
> Some minor observations:
>
>> --- a/include/linux/kexec.h
>> +++ b/include/linux/kexec.h
>>
>> ...
>>
>> @@ -331,6 +336,7 @@ struct kimage {
>> */
>> unsigned int hotplug_support:1;
>> #endif
>> + unsigned int no_cma : 1;
> "no_cma:1" is more conventional.
Fixed.
>
>> #ifdef ARCH_HAS_KIMAGE_ARCH
>> struct kimage_arch arch;
>> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
>> index 5ae1741ea8ea..8958ebfcff94 100644
>> --- a/include/uapi/linux/kexec.h
>> +++ b/include/uapi/linux/kexec.h
>>
>> ...
>>
>> +static int kimage_load_cma_segment(struct kimage *image, struct kexec_segment *segment)
>> +{
>> + unsigned long maddr;
>> + size_t ubytes, mbytes;
>> + int result = 0;
>> + unsigned char __user *buf = NULL;
>> + unsigned char *kbuf = NULL;
>> + char *ptr = page_address(segment->cma);
>> +
>> + if (image->file_mode)
>> + kbuf = segment->kbuf;
>> + else
>> + buf = segment->buf;
>> + ubytes = segment->bufsz;
>> + mbytes = segment->memsz;
>> + maddr = segment->mem;
>> +
>> + /* Initialize the buffer with zeros to allow for smaller input buffers */
>> + memset(ptr, 0, mbytes);
> Would it be more efficient to zero the remainder after performing the copy?
Yup. Changed :).
>
>> + /* Then copy from source buffer to the CMA one */
>> + while (mbytes) {
>> + size_t uchunk, mchunk;
>> +
>> + ptr += maddr & ~PAGE_MASK;
>> + mchunk = min_t(size_t, mbytes,
>> + PAGE_SIZE - (maddr & ~PAGE_MASK));
>> + uchunk = min(ubytes, mchunk);
>> +
>> + if (uchunk) {
>> + /* For file based kexec, source pages are in kernel memory */
>> + if (image->file_mode)
>> + memcpy(ptr, kbuf, uchunk);
>> + else
>> + result = copy_from_user(ptr, buf, uchunk);
>> + ubytes -= uchunk;
>> + if (image->file_mode)
>> + kbuf += uchunk;
>> + else
>> + buf += uchunk;
>> + }
>> +
>> + if (result) {
>> + result = -EFAULT;
>> + goto out;
>> + }
>> +
>> + ptr += mchunk;
>> + maddr += mchunk;
>> + mbytes -= mchunk;
>> +
>> + cond_resched();
>> + }
>> +out:
>> + return result;
>> +}
>> +
>>
>> ...
>>
>> --- a/kernel/kexec_file.c
>> +++ b/kernel/kexec_file.c
>>
>> ...
>>
>> @@ -632,6 +635,38 @@ static int kexec_walk_resources(struct kexec_buf *kbuf,
>> return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>> }
>>
>> +static int kexec_alloc_contig(struct kexec_buf *kbuf)
>> +{
>> + u32 pages = (u32)(kbuf->memsz >> PAGE_SHIFT);
> I don't think the cast is needed?
I added it to make it more explicit we're casting (and hence potentially
lose a few bits). At size_t, that is no longer necessary.
>
>> + unsigned long mem;
>> + struct page *p;
>> +
>> + if (kbuf->image->no_cma)
>> + return -EPERM;
>> +
>> + p = dma_alloc_from_contiguous(NULL, pages, get_order(kbuf->buf_align), true);
> dma_alloc_from_contiguous()'s `count' arg is size_t. Making `pages'
> size_t seems best here. (And nr_pages would be a better identifier!)
>
>
>> + if (!p)
>> + return -EADDRNOTAVAIL;
> EADDRNOTAVAIL is a networking thing. People will be surprised to see
> kexec returning networking errors. Perhaps choose something more
> appropriate?
I was surprised too when I saw it in the surrounding code. It's the
error code that locate_mem_hole() returns, so I copied it. Let me change
it at least in this function to something sensible.
Alex
Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
More information about the kexec
mailing list