[PATCH] kexec: Enable CMA based contiguous allocation

Mon May 12 16:59:33 PDT 2025

On Mon, 12 May 2025 22:57:52 +0000 Alexander Graf <graf at amazon.com> wrote:

> When booting a new kernel with kexec_file, the kernel picks a target
> location that the kernel should live at, then allocates random pages,
> checks whether any of those patches magically happens to coincide with
> a target address range and if so, uses them for that range.
> 
> For every page allocated this way, it then creates a page list that the
> relocation code - code that executes while all CPUs are off and we are
> just about to jump into the new kernel - copies to their final memory
> location. We can not put them there before, because chances are pretty
> good that at least some page in the target range is already in use by
> the currently running Linux environment.
> 
> All of this is inefficient.
> 
> Since kexec got introduced, Linux has gained the CMA framework which
> can perform physically contiguous memory mappings, while keeping that
> memory available for movable memory when it is not needed for contiguous
> allocations. The default CMA allocator is for DMA allocations.
> 
> This patch adds logic to the kexec file loader to attempt to place the
> target payload at a location allocated from CMA. If successful, it uses
> that memory range directly instead of creating copy instructions during
> the hot phase. To ensure that there is a safety net in case anything goes
> wrong with the CMA allocation, it also adds a flag for user space to force
> disable CMA allocations.
> 
> Using CMA allocations has two advantages:
> 
>   1) Faster. There is no more need to copy in the hot phase.

How much faster?  Kinda matters as "fast" is the whole point of the patch!

>   2) More robust. Even if by accident some page is still in use for DMA,
>      the new kernel image will be safe from that access because it resides
>      in a memory region that is considered allocated in the old kernel and
>      has a chance to reinitialize that component.

Is this known to be a problem in current code?

Some minor observations:

> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
>
> ...
>
> @@ -331,6 +336,7 @@ struct kimage {
>  	 */
>  	unsigned int hotplug_support:1;
>  #endif
> +	unsigned int no_cma : 1;

"no_cma:1" is more conventional.

>  #ifdef ARCH_HAS_KIMAGE_ARCH
>  	struct kimage_arch arch;
> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h
> index 5ae1741ea8ea..8958ebfcff94 100644
> --- a/include/uapi/linux/kexec.h
> +++ b/include/uapi/linux/kexec.h
>
> ...
>
> +static int kimage_load_cma_segment(struct kimage *image, struct kexec_segment *segment)
> +{
> +	unsigned long maddr;
> +	size_t ubytes, mbytes;
> +	int result = 0;
> +	unsigned char __user *buf = NULL;
> +	unsigned char *kbuf = NULL;
> +	char *ptr = page_address(segment->cma);
> +
> +	if (image->file_mode)
> +		kbuf = segment->kbuf;
> +	else
> +		buf = segment->buf;
> +	ubytes = segment->bufsz;
> +	mbytes = segment->memsz;
> +	maddr = segment->mem;
> +
> +	/* Initialize the buffer with zeros to allow for smaller input buffers */
> +	memset(ptr, 0, mbytes);

Would it be more efficient to zero the remainder after performing the copy?

> +	/* Then copy from source buffer to the CMA one */
> +	while (mbytes) {
> +		size_t uchunk, mchunk;
> +
> +		ptr += maddr & ~PAGE_MASK;
> +		mchunk = min_t(size_t, mbytes,
> +				PAGE_SIZE - (maddr & ~PAGE_MASK));
> +		uchunk = min(ubytes, mchunk);
> +
> +		if (uchunk) {
> +			/* For file based kexec, source pages are in kernel memory */
> +			if (image->file_mode)
> +				memcpy(ptr, kbuf, uchunk);
> +			else
> +				result = copy_from_user(ptr, buf, uchunk);
> +			ubytes -= uchunk;
> +			if (image->file_mode)
> +				kbuf += uchunk;
> +			else
> +				buf += uchunk;
> +		}
> +
> +		if (result) {
> +			result = -EFAULT;
> +			goto out;
> +		}
> +
> +		ptr    += mchunk;
> +		maddr  += mchunk;
> +		mbytes -= mchunk;
> +
> +		cond_resched();
> +	}
> +out:
> +	return result;
> +}
> +
>
> ...
>
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
>
> ...
>
> @@ -632,6 +635,38 @@ static int kexec_walk_resources(struct kexec_buf *kbuf,
>  		return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>  }
>  
> +static int kexec_alloc_contig(struct kexec_buf *kbuf)
> +{
> +	u32 pages = (u32)(kbuf->memsz >> PAGE_SHIFT);

I don't think the cast is needed?

> +	unsigned long mem;
> +	struct page *p;
> +
> +	if (kbuf->image->no_cma)
> +		return -EPERM;
> +
> +	p = dma_alloc_from_contiguous(NULL, pages, get_order(kbuf->buf_align), true);

dma_alloc_from_contiguous()'s `count' arg is size_t.  Making `pages'
size_t seems best here.  (And nr_pages would be a better identifier!)

> +	if (!p)
> +		return -EADDRNOTAVAIL;

EADDRNOTAVAIL is a networking thing.  People will be surprised to see
kexec returning networking errors.  Perhaps choose something more
appropriate?

> +	pr_debug("allocated %d DMA pages at 0x%lx", pages, page_to_boot_pfn(p));
> +
> +	mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +
> +	if (kimage_is_destination_range(kbuf->image, mem, mem + kbuf->memsz)) {
> +		/* Our region is already in use by a statically defined one. Bail out. */
> +		pr_debug("CMA overlaps existing mem: 0x%lx+0x%lx\n", mem, kbuf->memsz);
> +		dma_release_from_contiguous(NULL, p, pages);
> +		return -EADDRNOTAVAIL;
> +	}
> +
> +	kbuf->mem = page_to_boot_pfn(p) << PAGE_SHIFT;
> +	kbuf->cma = p;
> +
> +	arch_kexec_post_alloc_pages(page_address(p), pages, 0);
> +
> +	return 0;
> +}
> +
>
> ...
>