[PATCH 11/13] arm64/kexec: Add core kexec support
Geoff Levand
geoff at infradead.org
Wed Sep 24 17:25:04 PDT 2014
Hi Mark,
On Thu, 2014-09-18 at 02:13 +0100, Mark Rutland wrote:
> On Tue, Sep 09, 2014 at 11:49:05PM +0100, Geoff Levand wrote:
> > +++ b/arch/arm64/include/asm/kexec.h
> > @@ -0,0 +1,52 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#if !defined(_ARM64_KEXEC_H)
> > +#define _ARM64_KEXEC_H
> > +
> > +/* Maximum physical address we can use pages from */
> > +
> > +#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
> > +
> > +/* Maximum address we can reach in physical address mode */
> > +
> > +#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
> > +
> > +/* Maximum address we can use for the control code buffer */
> > +
> > +#define KEXEC_CONTROL_MEMORY_LIMIT (-1UL)
> > +
>
> What are these used for? I see that other architectures seem to do the
> same thing, but they look odd.
They need to be defined for the core kexec code, but arm64
doesn't use them.
> > +#define KEXEC_CONTROL_PAGE_SIZE 4096
>
> What's this used for?
This is the size reserved for the reboot_code_buffer defined in
kexec's core code. For arm64, we copy our relocate_new_kernel
routine into the reboot_code_buffer.
> Does this work with 64k pages, and is there any reason we can't figure
> out the actual size of the code (so we don't get bitten if it grows)?
Kexec will reserve pages to satisfy KEXEC_CONTROL_PAGE_SIZE, so for
all arm64 page configs one page will be reserved for this value (4096).
I have a check if relocate_new_kernel is too big
'.org KEXEC_CONTROL_PAGE_SIZE' in the latest implementation.
> > +
> > +#define KEXEC_ARCH KEXEC_ARCH_ARM64
> > +
> > +#define ARCH_HAS_KIMAGE_ARCH
> > +
> > +#if !defined(__ASSEMBLY__)
> > +
> > +struct kimage_arch {
> > + void *ctx;
> > +};
> > +
> > +/**
> > + * crash_setup_regs() - save registers for the panic kernel
> > + *
> > + * @newregs: registers are saved here
> > + * @oldregs: registers to be saved (may be %NULL)
> > + */
> > +
> > +static inline void crash_setup_regs(struct pt_regs *newregs,
> > + struct pt_regs *oldregs)
> > +{
> > +}
>
> It would be nice to know what we're going to do for this.
>
> Is this a required function, or can we get away without crash kernel
> support for the moment?
This is just to avoid a build error. It is not used for kexec re-boot.
> > +
> > +#endif /* !defined(__ASSEMBLY__) */
> > +
> > +#endif
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index df7ef87..8b7c029 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -29,6 +29,8 @@ arm64-obj-$(CONFIG_ARM64_CPU_SUSPEND) += sleep.o suspend.o
> > arm64-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
> > arm64-obj-$(CONFIG_KGDB) += kgdb.o
> > arm64-obj-$(CONFIG_EFI) += efi.o efi-stub.o efi-entry.o
> > +arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
> > + cpu-properties.o
> >
> > obj-y += $(arm64-obj-y) vdso/
> > obj-m += $(arm64-obj-m)
> > diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> > new file mode 100644
> > index 0000000..043a3bc
> > --- /dev/null
> > +++ b/arch/arm64/kernel/machine_kexec.c
> > @@ -0,0 +1,612 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/kexec.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +
> > +#include <asm/cacheflush.h>
> > +#include <asm/cpu_ops.h>
> > +#include <asm/system_misc.h>
> > +
> > +#include "cpu-properties.h"
> > +
> > +#if defined(DEBUG)
> > +static const int debug = 1;
> > +#else
> > +static const int debug;
> > +#endif
>
> I don't think we need this.
I put the debug output into another patch, which I'll
decide to post or not later.
> > +
> > +typedef struct dtb_buffer {char b[0]; } dtb_t;
>
> It would be nice for this to be consistent with other dtb uses; if we
> need a dtb type then it shouldn't be specific to kexec.
This was to avoid errors due to the lack of type checking with
void* types. I've reworked this in the latest patch.
> > +static struct kexec_ctx *current_ctx;
> > +
> > +static int kexec_ctx_alloc(struct kimage *image)
> > +{
> > + BUG_ON(image->arch.ctx);
> > +
> > + image->arch.ctx = kmalloc(sizeof(struct kexec_ctx), GFP_KERNEL);
> > +
> > + if (!image->arch.ctx)
> > + return -ENOMEM;
> > +
> > + current_ctx = (struct kexec_ctx *)image->arch.ctx;
>
> This seems to be the only use of current_ctx. I take it this is a
> leftover from debugging?
>
> [...]
>
> > +/**
> > + * kexec_list_walk - Helper to walk the kimage page list.
> > + */
>
> Please keep this associated with the function it refers to (nothing
> should be between this comment and the function prototype).
>
> > +
> > +#define IND_FLAGS (IND_DESTINATION | IND_INDIRECTION | IND_DONE | IND_SOURCE)
>
> Can't this live in include/linux/kexec.h, where these flags are defined.
I have a kexec patch submitted to clean this up. I'll re-factor
this when that patch is upstream.
https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-August/120368.html
> The meaning of these doesn't seem to be documented anywhere. Would you
> be able to explain what each of these means?
I think lack of comments/documentation is a general weakness of the
core kexec code.
> > +static void kexec_list_walk(void *ctx, unsigned long kimage_head,
> > + void (*cb)(void *ctx, unsigned int flag, void *addr, void *dest))
> > +{
> > + void *dest;
> > + unsigned long *entry;
> > +
> > + for (entry = &kimage_head, dest = NULL; ; entry++) {
> > + unsigned int flag = *entry & IND_FLAGS;
> > + void *addr = phys_to_virt(*entry & PAGE_MASK);
> > +
> > + switch (flag) {
> > + case IND_INDIRECTION:
> > + entry = (unsigned long *)addr - 1;
> > + cb(ctx, flag, addr, NULL);
> > + break;
> > + case IND_DESTINATION:
> > + dest = addr;
> > + cb(ctx, flag, addr, NULL);
> > + break;
> > + case IND_SOURCE:
> > + cb(ctx, flag, addr, dest);
> > + dest += PAGE_SIZE;
>
> I really don't understand what's going on with dest here, but that's
> probably because I don't understand the meaning of the flags.
IND_SOURCE means the entry is a page of the current segment. dest is
the address of that page. When you have a new source page the
destination is post incremented. Think foo(src, dest++).
> > + break;
> > + case IND_DONE:
> > + cb(ctx, flag , NULL, NULL);
> > + return;
> > + default:
> > + pr_devel("%s:%d unknown flag %xh\n", __func__, __LINE__,
> > + flag);
>
> Wouldn't pr_warn would be more appropriate here?
We don't really don't need a message since the IND_ flags are well
established. I'll remove this.
>
> > + cb(ctx, flag, addr, NULL);
> > + break;
> > + }
> > + }
> > +}
> > +
> > +/**
> > + * kexec_image_info - For debugging output.
> > + */
> > +
> > +#define kexec_image_info(_i) _kexec_image_info(__func__, __LINE__, _i)
> > +static void _kexec_image_info(const char *func, int line,
> > + const struct kimage *image)
> > +{
> > + if (debug) {
> > + unsigned long i;
> > +
> > + pr_devel("%s:%d:\n", func, line);
> > + pr_devel(" kexec image info:\n");
> > + pr_devel(" type: %d\n", image->type);
> > + pr_devel(" start: %lx\n", image->start);
> > + pr_devel(" head: %lx\n", image->head);
> > + pr_devel(" nr_segments: %lu\n", image->nr_segments);
> > +
> > + for (i = 0; i < image->nr_segments; i++) {
> > + pr_devel(" segment[%lu]: %016lx - %016lx, "
> > + "%lxh bytes, %lu pages\n",
> > + i,
> > + image->segment[i].mem,
> > + image->segment[i].mem + image->segment[i].memsz,
> > + image->segment[i].memsz,
> > + image->segment[i].memsz / PAGE_SIZE);
> > +
> > + if (kexec_is_dtb_user(image->segment[i].buf))
> > + pr_devel(" dtb segment\n");
> > + }
> > + }
> > +}
>
> pr_devel is already dependent on DEBUG, so surely we don't need to check
> the debug variable?
I'm not sure how much of this would be removed as dead code. If
the compiler is cleaver enough it all should be.
> > +
> > +/**
> > + * kexec_find_dtb_seg - Helper routine to find the dtb segment.
> > + */
> > +
> > +static const struct kexec_segment *kexec_find_dtb_seg(
> > + const struct kimage *image)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < image->nr_segments; i++) {
> > + if (kexec_is_dtb_user(image->segment[i].buf))
> > + return &image->segment[i];
> > + }
> > +
> > + return NULL;
> > +}
>
> I'm really not keen on having the kernel guess which blobs need special
> treatment, though we seem to do that for arm.
Yes, to pass the dtb in r0 when th new kernel is entered.
> It would be far nicer if we could pass flags for each segment to
> describe what it is (e.g. kernel image, dtb, other binary blob),
Well, we do pass a flag of sorts, the DT magic value.
> so we
> can do things like pass multiple DTBs (so we load two kernels at once
> and pass each a unique DTB if we want to boot a new kernel + crashkernel
> pair). Unfortunately that would require some fairly invasive rework of
> the kexec core.
I don't think I'll attempt that any time soon. Feel free to
give it a try.
> For secureboot we can't trust a dtb from userspace, and will have to use
> kexec_file_load. To work with that we can either:
>
> * Reuse the original DTB, patched with the new command line. This may
> have statefulness issues (for things like simplefb).
>
> * Build a new DTB by flattening the current live tree. This would rely
> on drivers that modify state to patch the tree appropriately.
I have not yet looked into how to do this yet.
> [...]
>
> > +/**
> > + * kexec_cpu_info_init - Initialize an array of kexec_cpu_info structures.
> > + *
> > + * Allocates a cpu info array and fills it with info for all cpus found in
> > + * the device tree passed.
> > + */
> > +
> > +static int kexec_cpu_info_init(const struct device_node *dn,
> > + struct kexec_boot_info *info)
> > +{
> > + int result;
> > + unsigned int cpu;
> > +
> > + info->cp = kmalloc(
> > + info->cpu_count * sizeof(*info->cp), GFP_KERNEL);
> > +
> > + if (!info->cp) {
> > + pr_err("%s: Error: Out of memory.", __func__);
> > + return -ENOMEM;
> > + }
> > +
> > + for (cpu = 0; cpu < info->cpu_count; cpu++) {
> > + struct cpu_properties *cp = &info->cp[cpu];
> > +
> > + dn = of_find_node_by_type((struct device_node *)dn, "cpu");
> > +
> > + if (!dn) {
> > + pr_devel("%s:%d: bad node\n", __func__, __LINE__);
> > + goto on_error;
> > + }
> > +
> > + result = read_cpu_properties(cp, dn);
> > +
> > + if (result) {
> > + pr_devel("%s:%d: bad node\n", __func__, __LINE__);
> > + goto on_error;
> > + }
> > +
> > + if (cp->type == cpu_enable_method_psci)
> > + pr_devel("%s:%d: cpu-%u: hwid-%llx, '%s'\n",
> > + __func__, __LINE__, cpu, cp->hwid,
> > + cp->enable_method);
> > + else
> > + pr_devel("%s:%d: cpu-%u: hwid-%llx, '%s', "
> > + "cpu-release-addr %llx\n",
> > + __func__, __LINE__, cpu, cp->hwid,
> > + cp->enable_method,
> > + cp->cpu_release_addr);
> > + }
> > +
> > + return 0;
> > +
> > +on_error:
> > + kfree(info->cp);
> > + info->cp = NULL;
> > + return -EINVAL;
> > +}
>
> I don't see why we should need this at all. If we use the hotplug
> infrastructure, we don't need access to the enable-method and related
> properties, and the kexec code need only deal with a single CPU.
I removed all the checking in the latest patch.
> The only case where kexec needs to deal with other CPUs is when some are
> sat in the holding pen, but this code doesn't seem to handle that.
>
> as I believe I mentioned before, we should be able to extend the holding
> pen code to get those CPUs to increment a sat-in-pen counter and if
> that's non-zero after SMP bringup we print a warning (and disallow
> kexec).
I have some work-in-progress patches that try to do this, but I will not
include those in this series. See my spin-table branch:
https://git.linaro.org/people/geoff.levand/linux-kexec.git
> > +/**
> > +* kexec_compat_check - Iterator for kexec_cpu_check.
> > +*/
> > +
> > +static int kexec_compat_check(const struct kexec_ctx *ctx)
> > +{
> > + unsigned int cpu_1;
> > + unsigned int to_process;
> > +
> > + to_process = min(ctx->first.cpu_count, ctx->second.cpu_count);
> > +
> > + if (ctx->first.cpu_count != ctx->second.cpu_count)
> > + pr_warn("%s: Warning: CPU count mismatch %u != %u.\n",
> > + __func__, ctx->first.cpu_count, ctx->second.cpu_count);
> > +
> > + for (cpu_1 = 0; cpu_1 < ctx->first.cpu_count; cpu_1++) {
> > + unsigned int cpu_2;
> > + struct cpu_properties *cp_1 = &ctx->first.cp[cpu_1];
> > +
> > + for (cpu_2 = 0; cpu_2 < ctx->second.cpu_count; cpu_2++) {
> > + struct cpu_properties *cp_2 = &ctx->second.cp[cpu_2];
> > +
> > + if (cp_1->hwid != cp_2->hwid)
> > + continue;
> > +
> > + if (!kexec_cpu_check(cp_1, cp_2))
> > + return -EINVAL;
> > +
> > + to_process--;
> > + }
> > + }
> > +
> > + if (to_process) {
> > + pr_warn("%s: Warning: Failed to process %u CPUs.\n", __func__,
> > + to_process);
> > + return -EINVAL;
> > + }
> > +
> > + return 0;
> > +}
>
> I don't see the point in checking this in the kernel. If I pass the
> second stage kernel a new dtb where my enable methods are different,
> that was my choice as a user. If that doesn't work, that's my fault.
>
> There are plenty of other things that might be completely different that
> we don't sanity check, so I don't see why enable methods should be any
> different.
>
> [...]
>
> > +/**
> > + * kexec_check_cpu_die - Check if cpu_die() will work on secondary processors.
> > + */
> > +
> > +static int kexec_check_cpu_die(void)
> > +{
> > + unsigned int cpu;
> > + unsigned int sum = 0;
> > +
> > + /* For simplicity this also checks the primary CPU. */
> > +
> > + for_each_cpu(cpu, cpu_all_mask) {
> > + if (cpu && (!cpu_ops[cpu] || !cpu_ops[cpu]->cpu_disable ||
> > + cpu_ops[cpu]->cpu_disable(cpu))) {
> > + sum++;
> > + pr_err("%s: Error: "
> > + "CPU %u does not support hot un-plug.\n",
> > + __func__, cpu);
> > + }
> > + }
> > +
> > + return sum ? -EOPNOTSUPP : 0;
> > +}
>
> We should really use disable_nonboot_cpus() for this. That way we don't
> end up with a slightly different hotplug implementation for kexec. The
> above is missing cpu_kill calls, for example, and I'm worried by the
> possibility of further drift over time.
>
> I understand from our face-to-face discussion that you didn't want to
> require the PM infrastructure that disable_nonboot_cpus currently pulls
> in due to the being dependent on CONFIG_PM_SLEEP_SMP which selects
> CONFIG_PM_SLEEP and so on. The solution to that is to refactor the
> Kconfig so we can have disable_nonboot_cpus without all the other PM
> infrastructure.
I switch the current patch to use disable_nonboot_cpus().
> > +
> > +/**
> > + * machine_kexec_prepare - Prepare for a kexec reboot.
> > + *
> > + * Called from the core kexec code when a kernel image is loaded.
> > + */
> > +
> > +int machine_kexec_prepare(struct kimage *image)
> > +{
> > + int result;
>
> This seems to always be an error code. Call it 'err'.
>
> > + dtb_t *dtb = NULL;
> > + struct kexec_ctx *ctx;
> > + const struct kexec_segment *dtb_seg;
> > +
> > + kexec_image_info(image);
> > +
> > + result = kexec_check_cpu_die();
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + result = kexec_ctx_alloc(image);
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + ctx = kexec_image_to_ctx(image);
> > +
> > + result = kexec_boot_info_init(&ctx->first, NULL);
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + dtb_seg = kexec_find_dtb_seg(image);
> > +
> > + if (!dtb_seg) {
> > + result = -EINVAL;
> > + goto on_error;
> > + }
> > +
> > + result = kexec_copy_dtb(dtb_seg, &dtb);
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + result = kexec_boot_info_init(&ctx->second, dtb);
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + result = kexec_compat_check(ctx);
> > +
> > + if (result)
> > + goto on_error;
> > +
> > + kexec_dtb_addr = dtb_seg->mem;
> > + kexec_kimage_start = image->start;
> > +
> > + goto on_exit;
> > +
> > +on_error:
> > + kexec_ctx_clean(image);
> > +on_exit:
>
> on_* looks weird, and doesn't match the style of other labels in
> arch/arm64. Could we call these 'out_clean' and 'out' instead?
>
> > + kfree(dtb);
> > + return result;
> > +}
> > +
> > +/**
> > + * kexec_list_flush_cb - Callback to flush the kimage list to PoC.
> > + */
> > +
> > +static void kexec_list_flush_cb(void *ctx , unsigned int flag,
> > + void *addr, void *dest)
> > +{
> > + switch (flag) {
> > + case IND_INDIRECTION:
> > + case IND_SOURCE:
> > + __flush_dcache_area(addr, PAGE_SIZE);
>
> Is PAGE_SIZE always big enough? Do we not have a more accurate size?
> Perhaps I've misunderstood what's going on here.
The image list is a list of pages, so PAGE_SIZE should be OK.
> > + break;
> > + default:
> > + break;
> > + }
> > +}
> > +
> > +/**
> > + * machine_kexec - Do the kexec reboot.
> > + *
> > + * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
> > + */
> > +
> > +void machine_kexec(struct kimage *image)
> > +{
> > + phys_addr_t reboot_code_buffer_phys;
> > + void *reboot_code_buffer;
> > + struct kexec_ctx *ctx = kexec_image_to_ctx(image);
> > +
> > + BUG_ON(relocate_new_kernel_size > KEXEC_CONTROL_PAGE_SIZE);
>
> It looks like relocate_new_kernel_size is a build-time constant. If we
> need that to be less than KEXEC_CONTROL_PAGE_SIZE, then we should make
> that a build-time check.
I moved this check into relocate_new_kernel with a
'.org KEXEC_CONTROL_PAGE_SIZE'.
> > + BUG_ON(num_online_cpus() > 1);
> > + BUG_ON(!ctx);
> > +
> > + kexec_image_info(image);
> > +
> > + kexec_kimage_head = image->head;
> > +
> > + reboot_code_buffer_phys = page_to_phys(image->control_code_page);
> > + reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
> > +
> > + pr_devel("%s:%d: control_code_page: %p\n", __func__, __LINE__,
> > + (void *)image->control_code_page);
>
> This is already a pointer. Is the cast to void necessary?
>
> > + pr_devel("%s:%d: reboot_code_buffer_phys: %p\n", __func__, __LINE__,
> > + (void *)reboot_code_buffer_phys);
>
> Use %pa and pass &reboot_code_buffer_phys, no cast necessary.
>
> > + pr_devel("%s:%d: reboot_code_buffer: %p\n", __func__, __LINE__,
> > + reboot_code_buffer);
> > + pr_devel("%s:%d: relocate_new_kernel: %p\n", __func__, __LINE__,
> > + relocate_new_kernel);
> > + pr_devel("%s:%d: relocate_new_kernel_size: %lxh(%lu) bytes\n", __func__,
> > + __LINE__, relocate_new_kernel_size, relocate_new_kernel_size);
>
> Please use an '0x' prefix rather than a 'h' suffix. Do we need in print
> in both hex and decimal?
>
> > +
> > + pr_devel("%s:%d: kexec_dtb_addr: %p\n", __func__, __LINE__,
> > + (void *)kexec_dtb_addr);
> > + pr_devel("%s:%d: kexec_kimage_head: %p\n", __func__, __LINE__,
> > + (void *)kexec_kimage_head);
> > + pr_devel("%s:%d: kexec_kimage_start: %p\n", __func__, __LINE__,
> > + (void *)kexec_kimage_start);
>
> These are all unsigned long, so why not use the existing mechanism for
> printing unsigned long?
>
> > +
> > + /*
> > + * Copy relocate_new_kernel to the reboot_code_buffer for use
> > + * after the kernel is shut down.
> > + */
> > +
> > + memcpy(reboot_code_buffer, relocate_new_kernel,
> > + relocate_new_kernel_size);
> > +
> > + /* Assure reboot_code_buffer is copied. */
> > +
> > + mb();
>
> I don't think we need the mb if this is only to guarantee completion
> before the cache flush -- cacheable memory accesses should hazard
> against cache flushes by VA.
OK.
> > +
> > + pr_info("Bye!\n");
> > +
> > + local_disable(DAIF_ALL);
>
> We can move these two right before the soft_restart, after the cache
> maintenance. That way the print is closer to the exit of the current
> kernel.
OK.
> > +
> > + /* Flush the reboot_code_buffer in preparation for its execution. */
> > +
> > + __flush_dcache_area(reboot_code_buffer, relocate_new_kernel_size);
> > +
> > + /* Flush the kimage list. */
> > +
> > + kexec_list_walk(NULL, image->head, kexec_list_flush_cb);
> > +
> > + soft_restart(reboot_code_buffer_phys);
> > +}
> > +
> > +void machine_crash_shutdown(struct pt_regs *regs)
> > +{
> > + /* Empty routine needed to avoid build errors. */
> > +}
> > diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> > new file mode 100644
> > index 0000000..92aba9d
> > --- /dev/null
> > +++ b/arch/arm64/kernel/relocate_kernel.S
> > @@ -0,0 +1,185 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <asm/assembler.h>
> > +#include <asm/memory.h>
> > +#include <asm/page.h>
> > +
> > +/* The list entry flags. */
> > +
> > +#define IND_DESTINATION_BIT 0
> > +#define IND_INDIRECTION_BIT 1
> > +#define IND_DONE_BIT 2
> > +#define IND_SOURCE_BIT 3
>
> Given these ned to match the existing IND_* flags in
> include/linux/kexec.h, and they aren't in any way specific to arm64,
> please put these ina an asm-generic header and redefine the existing
> IND_* flags in terms of them.
See my patch that does that:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-August/120368.html
>
> > +
> > +/*
> > + * relocate_new_kernel - Put the 2nd stage kernel image in place and boot it.
> > + *
> > + * The memory that the old kernel occupies may be overwritten when coping the
> > + * new kernel to its final location. To assure that the relocate_new_kernel
> > + * routine which does that copy is not overwritten all code and data needed
> > + * by relocate_new_kernel must be between the symbols relocate_new_kernel and
> > + * relocate_new_kernel_end. The machine_kexec() routine will copy
> > + * relocate_new_kernel to the kexec control_code_page, a special page which
> > + * has been set up to be preserved during the kernel copy operation.
> > + */
> > +
> > +.align 3
>
> Surely this isn't necessary?
No, the code should be properly aligned.
> > +
> > +.globl relocate_new_kernel
> > +relocate_new_kernel:
> > +
> > + /* Setup the list loop variables. */
> > +
> > + ldr x10, kexec_kimage_head /* x10 = list entry */
>
> Any reason for using x10 rather than starting with x0? Or x18, if you
> need to preserve the low registers?
>
> > +
> > + mrs x0, ctr_el0
> > + ubfm x0, x0, #16, #19
> > + mov x11, #4
> > + lsl x11, x11, x0 /* x11 = dcache line size */
>
> Any reason we can't use dcache_line_size, given it's a macro?
>
> > +
> > + mov x12, xzr /* x12 = segment start */
> > + mov x13, xzr /* x13 = entry ptr */
> > + mov x14, xzr /* x14 = copy dest */
> > +
> > + /* Check if the new kernel needs relocation. */
> > +
> > + cbz x10, .Ldone
> > + tbnz x10, IND_DONE_BIT, .Ldone
> > +
> > +.Lloop:
>
> Is there any reason for the '.L' on all of these? We only seem to do
> that in the lib code that was imported from elsewhere, and it doesn't
> match the rest of the arm64 asm.
.L is a local label prefix in gas. I don't think it would be good to have
these with larger scope.
> > + and x15, x10, PAGE_MASK /* x15 = addr */
> > +
> > + /* Test the entry flags. */
> > +
> > +.Ltest_source:
> > + tbz x10, IND_SOURCE_BIT, .Ltest_indirection
> > +
> > + /* copy_page(x20 = dest, x21 = src) */
> > +
> > + mov x20, x14
> > + mov x21, x15
> > +
> > +1: ldp x22, x23, [x21]
> > + ldp x24, x25, [x21, #16]
> > + ldp x26, x27, [x21, #32]
> > + ldp x28, x29, [x21, #48]
> > + add x21, x21, #64
> > + stnp x22, x23, [x20]
> > + stnp x24, x25, [x20, #16]
> > + stnp x26, x27, [x20, #32]
> > + stnp x28, x29, [x20, #48]
> > + add x20, x20, #64
> > + tst x21, #(PAGE_SIZE - 1)
> > + b.ne 1b
>
> It's a shame we can't reuse copy_page directly. Could we not move the
> body to a macro we can reuse here?
copy_page() also does some memory pre-fetch, which Arun said caused
problems on the APM board. If that board were available to me for
testing I could investigate, but at this time I will put this suggestion
on my todo list.
> > +
> > + /* dest += PAGE_SIZE */
> > +
> > + add x14, x14, PAGE_SIZE
> > + b .Lnext
> > +
> > +.Ltest_indirection:
> > + tbz x10, IND_INDIRECTION_BIT, .Ltest_destination
> > +
> > + /* ptr = addr */
> > +
> > + mov x13, x15
> > + b .Lnext
> > +
> > +.Ltest_destination:
> > + tbz x10, IND_DESTINATION_BIT, .Lnext
> > +
> > + /* flush segment */
> > +
> > + bl .Lflush
> > + mov x12, x15
> > +
> > + /* dest = addr */
> > +
> > + mov x14, x15
> > +
> > +.Lnext:
> > + /* entry = *ptr++ */
> > +
> > + ldr x10, [x13]
> > + add x13, x13, 8
>
> This can be:
>
> ldr x10, [x13], #8
>
> > +
> > + /* while (!(entry & DONE)) */
> > +
> > + tbz x10, IND_DONE_BIT, .Lloop
> > +
> > +.Ldone:
> > + /* flush last segment */
> > +
> > + bl .Lflush
> > +
> > + dsb sy
> > + isb
> > + ic ialluis
>
> This doesn't look right; we need a dsb and an isb after the instruction
> cache maintenance (or the icache could still be flushing when we branch
> to the new kernel).
OK.
> > +
> > + /* start_new_kernel */
> > +
> > + ldr x4, kexec_kimage_start
> > + ldr x0, kexec_dtb_addr
> > + mov x1, xzr
> > + mov x2, xzr
> > + mov x3, xzr
> > + br x4
> > +
> > +/* flush - x11 = line size, x12 = start addr, x14 = end addr. */
> > +
> > +.Lflush:
> > + cbz x12, 2f
> > + mov x0, x12
> > + sub x1, x11, #1
> > + bic x0, x0, x1
> > +1: dc civac, x0
> > + add x0, x0, x11
> > + cmp x0, x14
> > + b.lo 1b
> > +2: ret
>
> It would be nice if this were earlier in the file, before its callers.
Then we would need to jump over it, which I don't think is
very clean.
>
> > +
> > +.align 3
>
> We should have a comment as to why this is needed (to keep the 64-bit
> values below naturally aligned).
I haven't seen such an .align directive comment in any arm64 code yet.
-Geoff
More information about the kexec
mailing list