[PATCH 11/13] arm64/kexec: Add core kexec support

Wed Sep 24 17:25:04 PDT 2014

Hi Mark,

On Thu, 2014-09-18 at 02:13 +0100, Mark Rutland wrote:
> On Tue, Sep 09, 2014 at 11:49:05PM +0100, Geoff Levand wrote:

> > +++ b/arch/arm64/include/asm/kexec.h
> > @@ -0,0 +1,52 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#if !defined(_ARM64_KEXEC_H)
> > +#define _ARM64_KEXEC_H
> > +
> > +/* Maximum physical address we can use pages from */
> > +
> > +#define KEXEC_SOURCE_MEMORY_LIMIT (-1UL)
> > +
> > +/* Maximum address we can reach in physical address mode */
> > +
> > +#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL)
> > +
> > +/* Maximum address we can use for the control code buffer */
> > +
> > +#define KEXEC_CONTROL_MEMORY_LIMIT (-1UL)
> > +
> 
> What are these used for? I see that other architectures seem to do the
> same thing, but they look odd.

They need to be defined for the core kexec code, but arm64
doesn't use them.

> > +#define KEXEC_CONTROL_PAGE_SIZE        4096
> 
> What's this used for?

This is the size reserved for the reboot_code_buffer defined in
kexec's core code.  For arm64, we copy our relocate_new_kernel
routine into the reboot_code_buffer.

> Does this work with 64k pages, and is there any reason we can't figure
> out the actual size of the code (so we don't get bitten if it grows)?

Kexec will reserve pages to satisfy KEXEC_CONTROL_PAGE_SIZE, so for
all arm64 page configs one page will be reserved for this value (4096).

I have a check if relocate_new_kernel is too big
'.org KEXEC_CONTROL_PAGE_SIZE' in the latest implementation.

> > +
> > +#define KEXEC_ARCH KEXEC_ARCH_ARM64
> > +
> > +#define ARCH_HAS_KIMAGE_ARCH
> > +
> > +#if !defined(__ASSEMBLY__)
> > +
> > +struct kimage_arch {
> > +       void *ctx;
> > +};
> > +
> > +/**
> > + * crash_setup_regs() - save registers for the panic kernel
> > + *
> > + * @newregs: registers are saved here
> > + * @oldregs: registers to be saved (may be %NULL)
> > + */
> > +
> > +static inline void crash_setup_regs(struct pt_regs *newregs,
> > +                                   struct pt_regs *oldregs)
> > +{
> > +}
> 
> It would be nice to know what we're going to do for this.
> 
> Is this a required function, or can we get away without crash kernel
> support for the moment?

This is just to avoid a build error.  It is not used for kexec re-boot.

> > +
> > +#endif /* !defined(__ASSEMBLY__) */
> > +
> > +#endif
> > diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> > index df7ef87..8b7c029 100644
> > --- a/arch/arm64/kernel/Makefile
> > +++ b/arch/arm64/kernel/Makefile
> > @@ -29,6 +29,8 @@ arm64-obj-$(CONFIG_ARM64_CPU_SUSPEND) += sleep.o suspend.o
> >  arm64-obj-$(CONFIG_JUMP_LABEL)         += jump_label.o
> >  arm64-obj-$(CONFIG_KGDB)               += kgdb.o
> >  arm64-obj-$(CONFIG_EFI)                        += efi.o efi-stub.o efi-entry.o
> > +arm64-obj-$(CONFIG_KEXEC)              += machine_kexec.o relocate_kernel.o    \
> > +                                          cpu-properties.o
> >
> >  obj-y                                  += $(arm64-obj-y) vdso/
> >  obj-m                                  += $(arm64-obj-m)
> > diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
> > new file mode 100644
> > index 0000000..043a3bc
> > --- /dev/null
> > +++ b/arch/arm64/kernel/machine_kexec.c
> > @@ -0,0 +1,612 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/kexec.h>
> > +#include <linux/of_fdt.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +
> > +#include <asm/cacheflush.h>
> > +#include <asm/cpu_ops.h>
> > +#include <asm/system_misc.h>
> > +
> > +#include "cpu-properties.h"
> > +
> > +#if defined(DEBUG)
> > +static const int debug = 1;
> > +#else
> > +static const int debug;
> > +#endif
> 
> I don't think we need this.

I put the debug output into another patch, which I'll
decide to post or not later.

> > +
> > +typedef struct dtb_buffer {char b[0]; } dtb_t;
> 
> It would be nice for this to be consistent with other dtb uses; if we
> need a dtb type then it shouldn't be specific to kexec.

This was to avoid errors due to the lack of type checking with
void* types.  I've reworked this in the latest patch.

> > +static struct kexec_ctx *current_ctx;
> > +
> > +static int kexec_ctx_alloc(struct kimage *image)
> > +{
> > +       BUG_ON(image->arch.ctx);
> > +
> > +       image->arch.ctx = kmalloc(sizeof(struct kexec_ctx), GFP_KERNEL);
> > +
> > +       if (!image->arch.ctx)
> > +               return -ENOMEM;
> > +
> > +       current_ctx = (struct kexec_ctx *)image->arch.ctx;
> 
> This seems to be the only use of current_ctx. I take it this is a
> leftover from debugging?
> 
> [...]
> 
> > +/**
> > + * kexec_list_walk - Helper to walk the kimage page list.
> > + */
> 
> Please keep this associated with the function it refers to (nothing
> should be between this comment and the function prototype).
> 
> > +
> > +#define IND_FLAGS (IND_DESTINATION | IND_INDIRECTION | IND_DONE | IND_SOURCE)
> 
> Can't this live in include/linux/kexec.h, where these flags are defined.

I have a kexec patch submitted to clean this up.  I'll re-factor
this when that patch is upstream.

  https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-August/120368.html

> The meaning of these doesn't seem to be documented anywhere. Would you
> be able to explain what each of these means?

I think lack of comments/documentation is a general weakness of the
core kexec code.

> > +static void kexec_list_walk(void *ctx, unsigned long kimage_head,
> > +       void (*cb)(void *ctx, unsigned int flag, void *addr, void *dest))
> > +{
> > +       void *dest;
> > +       unsigned long *entry;
> > +
> > +       for (entry = &kimage_head, dest = NULL; ; entry++) {
> > +               unsigned int flag = *entry & IND_FLAGS;
> > +               void *addr = phys_to_virt(*entry & PAGE_MASK);
> > +
> > +               switch (flag) {
> > +               case IND_INDIRECTION:
> > +                       entry = (unsigned long *)addr - 1;
> > +                       cb(ctx, flag, addr, NULL);
> > +                       break;
> > +               case IND_DESTINATION:
> > +                       dest = addr;
> > +                       cb(ctx, flag, addr, NULL);
> > +                       break;
> > +               case IND_SOURCE:
> > +                       cb(ctx, flag, addr, dest);
> > +                       dest += PAGE_SIZE;
> 
> I really don't understand what's going on with dest here, but that's
> probably because I don't understand the meaning of the flags.

IND_SOURCE means the entry is a page of the current segment.  dest is
the address of that page.  When you have a new source page the
destination is post incremented.  Think foo(src, dest++).

> > +                       break;
> > +               case IND_DONE:
> > +                       cb(ctx, flag , NULL, NULL);
> > +                       return;
> > +               default:
> > +                       pr_devel("%s:%d unknown flag %xh\n", __func__, __LINE__,
> > +                               flag);
> 
> Wouldn't pr_warn would be more appropriate here?

We don't really don't need a message since the IND_ flags are well
established.  I'll remove this.

> 
> > +                       cb(ctx, flag, addr, NULL);
> > +                       break;
> > +               }
> > +       }
> > +}
> > +
> > +/**
> > + * kexec_image_info - For debugging output.
> > + */
> > +
> > +#define kexec_image_info(_i) _kexec_image_info(__func__, __LINE__, _i)
> > +static void _kexec_image_info(const char *func, int line,
> > +       const struct kimage *image)
> > +{
> > +       if (debug) {
> > +               unsigned long i;
> > +
> > +               pr_devel("%s:%d:\n", func, line);
> > +               pr_devel("  kexec image info:\n");
> > +               pr_devel("    type:        %d\n", image->type);
> > +               pr_devel("    start:       %lx\n", image->start);
> > +               pr_devel("    head:        %lx\n", image->head);
> > +               pr_devel("    nr_segments: %lu\n", image->nr_segments);
> > +
> > +               for (i = 0; i < image->nr_segments; i++) {
> > +                       pr_devel("      segment[%lu]: %016lx - %016lx, "
> > +                               "%lxh bytes, %lu pages\n",
> > +                               i,
> > +                               image->segment[i].mem,
> > +                               image->segment[i].mem + image->segment[i].memsz,
> > +                               image->segment[i].memsz,
> > +                               image->segment[i].memsz /  PAGE_SIZE);
> > +
> > +                       if (kexec_is_dtb_user(image->segment[i].buf))
> > +                               pr_devel("        dtb segment\n");
> > +               }
> > +       }
> > +}
> 
> pr_devel is already dependent on DEBUG, so surely we don't need to check
> the debug variable?

I'm not sure how much of this would be removed as dead code.  If
the compiler is cleaver enough it all should be.

> > +
> > +/**
> > + * kexec_find_dtb_seg - Helper routine to find the dtb segment.
> > + */
> > +
> > +static const struct kexec_segment *kexec_find_dtb_seg(
> > +       const struct kimage *image)
> > +{
> > +       int i;
> > +
> > +       for (i = 0; i < image->nr_segments; i++) {
> > +               if (kexec_is_dtb_user(image->segment[i].buf))
> > +                       return &image->segment[i];
> > +       }
> > +
> > +       return NULL;
> > +}
> 
> I'm really not keen on having the kernel guess which blobs need special
> treatment, though we seem to do that for arm.

Yes, to pass the dtb in r0 when th new kernel is entered.

> It would be far nicer if we could pass flags for each segment to
> describe what it is (e.g. kernel image, dtb, other binary blob), 

Well, we do pass a flag of sorts, the DT magic value.

> so we
> can do things like pass multiple DTBs (so we load two kernels at once
> and pass each a unique DTB if we want to boot a new kernel + crashkernel
> pair). Unfortunately that would require some fairly invasive rework of
> the kexec core.

I don't think I'll attempt that any time soon.  Feel free to
give it a try.

> For secureboot we can't trust a dtb from userspace, and will have to use
> kexec_file_load. To work with that we can either:
> 
> * Reuse the original DTB, patched with the new command line. This may
>   have statefulness issues (for things like simplefb).
> 
> * Build a new DTB by flattening the current live tree. This would rely
>   on drivers that modify state to patch the tree appropriately.

I have not yet looked into how to do this yet.

> [...]
> 
> > +/**
> > + * kexec_cpu_info_init - Initialize an array of kexec_cpu_info structures.
> > + *
> > + * Allocates a cpu info array and fills it with info for all cpus found in
> > + * the device tree passed.
> > + */
> > +
> > +static int kexec_cpu_info_init(const struct device_node *dn,
> > +       struct kexec_boot_info *info)
> > +{
> > +       int result;
> > +       unsigned int cpu;
> > +
> > +       info->cp = kmalloc(
> > +               info->cpu_count * sizeof(*info->cp), GFP_KERNEL);
> > +
> > +       if (!info->cp) {
> > +               pr_err("%s: Error: Out of memory.", __func__);
> > +               return -ENOMEM;
> > +       }
> > +
> > +       for (cpu = 0; cpu < info->cpu_count; cpu++) {
> > +               struct cpu_properties *cp = &info->cp[cpu];
> > +
> > +               dn = of_find_node_by_type((struct device_node *)dn, "cpu");
> > +
> > +               if (!dn) {
> > +                       pr_devel("%s:%d: bad node\n", __func__, __LINE__);
> > +                       goto on_error;
> > +               }
> > +
> > +               result = read_cpu_properties(cp, dn);
> > +
> > +               if (result) {
> > +                       pr_devel("%s:%d: bad node\n", __func__, __LINE__);
> > +                       goto on_error;
> > +               }
> > +
> > +               if (cp->type == cpu_enable_method_psci)
> > +                       pr_devel("%s:%d: cpu-%u: hwid-%llx, '%s'\n",
> > +                               __func__, __LINE__, cpu, cp->hwid,
> > +                               cp->enable_method);
> > +               else
> > +                       pr_devel("%s:%d: cpu-%u: hwid-%llx, '%s', "
> > +                               "cpu-release-addr %llx\n",
> > +                               __func__, __LINE__, cpu, cp->hwid,
> > +                               cp->enable_method,
> > +                               cp->cpu_release_addr);
> > +       }
> > +
> > +       return 0;
> > +
> > +on_error:
> > +       kfree(info->cp);
> > +       info->cp = NULL;
> > +       return -EINVAL;
> > +}
> 
> I don't see why we should need this at all. If we use the hotplug
> infrastructure, we don't need access to the enable-method and related
> properties, and the kexec code need only deal with a single CPU.

I removed all the checking in the latest patch.

> The only case where kexec needs to deal with other CPUs is when some are
> sat in the holding pen, but this code doesn't seem to handle that.
> 
> as I believe I mentioned before, we should be able to extend the holding
> pen code to get those CPUs to increment a sat-in-pen counter and if
> that's non-zero after SMP bringup we print a warning (and disallow
> kexec).

I have some work-in-progress patches that try to do this, but I will not
include those in this series.  See my spin-table branch:

  https://git.linaro.org/people/geoff.levand/linux-kexec.git

> > +/**
> > +* kexec_compat_check - Iterator for kexec_cpu_check.
> > +*/
> > +
> > +static int kexec_compat_check(const struct kexec_ctx *ctx)
> > +{
> > +       unsigned int cpu_1;
> > +       unsigned int to_process;
> > +
> > +       to_process = min(ctx->first.cpu_count, ctx->second.cpu_count);
> > +
> > +       if (ctx->first.cpu_count != ctx->second.cpu_count)
> > +               pr_warn("%s: Warning: CPU count mismatch %u != %u.\n",
> > +                       __func__, ctx->first.cpu_count, ctx->second.cpu_count);
> > +
> > +       for (cpu_1 = 0; cpu_1 < ctx->first.cpu_count; cpu_1++) {
> > +               unsigned int cpu_2;
> > +               struct cpu_properties *cp_1 = &ctx->first.cp[cpu_1];
> > +
> > +               for (cpu_2 = 0; cpu_2 < ctx->second.cpu_count; cpu_2++) {
> > +                       struct cpu_properties *cp_2 = &ctx->second.cp[cpu_2];
> > +
> > +                       if (cp_1->hwid != cp_2->hwid)
> > +                               continue;
> > +
> > +                       if (!kexec_cpu_check(cp_1, cp_2))
> > +                               return -EINVAL;
> > +
> > +                       to_process--;
> > +               }
> > +       }
> > +
> > +       if (to_process) {
> > +               pr_warn("%s: Warning: Failed to process %u CPUs.\n", __func__,
> > +                       to_process);
> > +               return -EINVAL;
> > +       }
> > +
> > +       return 0;
> > +}
> 
> I don't see the point in checking this in the kernel. If I pass the
> second stage kernel a new dtb where my enable methods are different,
> that was my choice as a user. If that doesn't work, that's my fault.
> 
> There are plenty of other things that might be completely different that
> we don't sanity check, so I don't see why enable methods should be any
> different.
> 
> [...]
> 
> > +/**
> > + * kexec_check_cpu_die -  Check if cpu_die() will work on secondary processors.
> > + */
> > +
> > +static int kexec_check_cpu_die(void)
> > +{
> > +       unsigned int cpu;
> > +       unsigned int sum = 0;
> > +
> > +       /* For simplicity this also checks the primary CPU. */
> > +
> > +       for_each_cpu(cpu, cpu_all_mask) {
> > +               if (cpu && (!cpu_ops[cpu] || !cpu_ops[cpu]->cpu_disable ||
> > +                       cpu_ops[cpu]->cpu_disable(cpu))) {
> > +                       sum++;
> > +                       pr_err("%s: Error: "
> > +                               "CPU %u does not support hot un-plug.\n",
> > +                               __func__, cpu);
> > +               }
> > +       }
> > +
> > +       return sum ? -EOPNOTSUPP : 0;
> > +}
> 
> We should really use disable_nonboot_cpus() for this. That way we don't
> end up with a slightly different hotplug implementation for kexec. The
> above is missing cpu_kill calls, for example, and I'm worried by the
> possibility of further drift over time.
> 
> I understand from our face-to-face discussion that you didn't want to
> require the PM infrastructure that disable_nonboot_cpus currently pulls
> in due to the being dependent on CONFIG_PM_SLEEP_SMP which selects
> CONFIG_PM_SLEEP and so on. The solution to that is to refactor the
> Kconfig so we can have disable_nonboot_cpus without all the other PM
> infrastructure.

I switch the current patch to use disable_nonboot_cpus().

> > +
> > +/**
> > + * machine_kexec_prepare - Prepare for a kexec reboot.
> > + *
> > + * Called from the core kexec code when a kernel image is loaded.
> > + */
> > +
> > +int machine_kexec_prepare(struct kimage *image)
> > +{
> > +       int result;
> 
> This seems to always be an error code. Call it 'err'.
> 
> > +       dtb_t *dtb = NULL;
> > +       struct kexec_ctx *ctx;
> > +       const struct kexec_segment *dtb_seg;
> > +
> > +       kexec_image_info(image);
> > +
> > +       result = kexec_check_cpu_die();
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       result = kexec_ctx_alloc(image);
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       ctx = kexec_image_to_ctx(image);
> > +
> > +       result = kexec_boot_info_init(&ctx->first, NULL);
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       dtb_seg = kexec_find_dtb_seg(image);
> > +
> > +       if (!dtb_seg) {
> > +               result = -EINVAL;
> > +               goto on_error;
> > +       }
> > +
> > +       result = kexec_copy_dtb(dtb_seg, &dtb);
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       result = kexec_boot_info_init(&ctx->second, dtb);
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       result = kexec_compat_check(ctx);
> > +
> > +       if (result)
> > +               goto on_error;
> > +
> > +       kexec_dtb_addr = dtb_seg->mem;
> > +       kexec_kimage_start = image->start;
> > +
> > +       goto on_exit;
> > +
> > +on_error:
> > +       kexec_ctx_clean(image);
> > +on_exit:
> 
> on_* looks weird, and doesn't match the style of other labels in
> arch/arm64. Could we call these 'out_clean' and 'out' instead?
> 
> > +       kfree(dtb);
> > +       return result;
> > +}
> > +
> > +/**
> > + * kexec_list_flush_cb - Callback to flush the kimage list to PoC.
> > + */
> > +
> > +static void kexec_list_flush_cb(void *ctx , unsigned int flag,
> > +       void *addr, void *dest)
> > +{
> > +       switch (flag) {
> > +       case IND_INDIRECTION:
> > +       case IND_SOURCE:
> > +               __flush_dcache_area(addr, PAGE_SIZE);
> 
> Is PAGE_SIZE always big enough? Do we not have a more accurate size?
> Perhaps I've misunderstood what's going on here.

The image list is a list of pages, so PAGE_SIZE should be OK.

> > +               break;
> > +       default:
> > +               break;
> > +       }
> > +}
> > +
> > +/**
> > + * machine_kexec - Do the kexec reboot.
> > + *
> > + * Called from the core kexec code for a sys_reboot with LINUX_REBOOT_CMD_KEXEC.
> > + */
> > +
> > +void machine_kexec(struct kimage *image)
> > +{
> > +       phys_addr_t reboot_code_buffer_phys;
> > +       void *reboot_code_buffer;
> > +       struct kexec_ctx *ctx = kexec_image_to_ctx(image);
> > +
> > +       BUG_ON(relocate_new_kernel_size > KEXEC_CONTROL_PAGE_SIZE);
> 
> It looks like relocate_new_kernel_size is a build-time constant. If we
> need that to be less than KEXEC_CONTROL_PAGE_SIZE, then we should make
> that a build-time check.

I moved this check into relocate_new_kernel with a
'.org KEXEC_CONTROL_PAGE_SIZE'.

> > +       BUG_ON(num_online_cpus() > 1);
> > +       BUG_ON(!ctx);
> > +
> > +       kexec_image_info(image);
> > +
> > +       kexec_kimage_head = image->head;
> > +
> > +       reboot_code_buffer_phys = page_to_phys(image->control_code_page);
> > +       reboot_code_buffer = phys_to_virt(reboot_code_buffer_phys);
> > +
> > +       pr_devel("%s:%d: control_code_page:        %p\n", __func__, __LINE__,
> > +               (void *)image->control_code_page);
> 
> This is already a pointer. Is the cast to void necessary?
> 
> > +       pr_devel("%s:%d: reboot_code_buffer_phys:  %p\n", __func__, __LINE__,
> > +               (void *)reboot_code_buffer_phys);
> 
> Use %pa and pass &reboot_code_buffer_phys, no cast necessary.
> 
> > +       pr_devel("%s:%d: reboot_code_buffer:       %p\n", __func__, __LINE__,
> > +               reboot_code_buffer);
> > +       pr_devel("%s:%d: relocate_new_kernel:      %p\n", __func__, __LINE__,
> > +               relocate_new_kernel);
> > +       pr_devel("%s:%d: relocate_new_kernel_size: %lxh(%lu) bytes\n", __func__,
> > +               __LINE__, relocate_new_kernel_size, relocate_new_kernel_size);
> 
> Please use an '0x' prefix rather than a 'h' suffix. Do we need in print
> in both hex and decimal?
> 
> > +
> > +       pr_devel("%s:%d: kexec_dtb_addr:           %p\n", __func__, __LINE__,
> > +               (void *)kexec_dtb_addr);
> > +       pr_devel("%s:%d: kexec_kimage_head:        %p\n", __func__, __LINE__,
> > +               (void *)kexec_kimage_head);
> > +       pr_devel("%s:%d: kexec_kimage_start:       %p\n", __func__, __LINE__,
> > +               (void *)kexec_kimage_start);
> 
> These are all unsigned long, so why not use the existing mechanism for
> printing unsigned long?
> 
> > +
> > +       /*
> > +        * Copy relocate_new_kernel to the reboot_code_buffer for use
> > +        * after the kernel is shut down.
> > +        */
> > +
> > +       memcpy(reboot_code_buffer, relocate_new_kernel,
> > +               relocate_new_kernel_size);
> > +
> > +       /* Assure reboot_code_buffer is copied. */
> > +
> > +       mb();
> 
> I don't think we need the mb if this is only to guarantee completion
> before the cache flush -- cacheable memory accesses should hazard
> against cache flushes by VA.

OK.

> > +
> > +       pr_info("Bye!\n");
> > +
> > +       local_disable(DAIF_ALL);
> 
> We can move these two right before the soft_restart, after the cache
> maintenance. That way the print is closer to the exit of the current
> kernel.

OK.

> > +
> > +       /* Flush the reboot_code_buffer in preparation for its execution. */
> > +
> > +       __flush_dcache_area(reboot_code_buffer, relocate_new_kernel_size);
> > +
> > +       /* Flush the kimage list. */
> > +
> > +       kexec_list_walk(NULL, image->head, kexec_list_flush_cb);
> > +
> > +       soft_restart(reboot_code_buffer_phys);
> > +}
> > +
> > +void machine_crash_shutdown(struct pt_regs *regs)
> > +{
> > +       /* Empty routine needed to avoid build errors. */
> > +}
> > diff --git a/arch/arm64/kernel/relocate_kernel.S b/arch/arm64/kernel/relocate_kernel.S
> > new file mode 100644
> > index 0000000..92aba9d
> > --- /dev/null
> > +++ b/arch/arm64/kernel/relocate_kernel.S
> > @@ -0,0 +1,185 @@
> > +/*
> > + * kexec for arm64
> > + *
> > + * Copyright (C) Linaro.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <asm/assembler.h>
> > +#include <asm/memory.h>
> > +#include <asm/page.h>
> > +
> > +/* The list entry flags. */
> > +
> > +#define IND_DESTINATION_BIT 0
> > +#define IND_INDIRECTION_BIT 1
> > +#define IND_DONE_BIT        2
> > +#define IND_SOURCE_BIT      3
> 
> Given these ned to match the existing IND_* flags in
> include/linux/kexec.h, and they aren't in any way specific to arm64,
> please put these ina an asm-generic header and redefine the existing
> IND_* flags in terms of them.

See my patch that does that:

  https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-August/120368.html
> 
> > +
> > +/*
> > + * relocate_new_kernel - Put the 2nd stage kernel image in place and boot it.
> > + *
> > + * The memory that the old kernel occupies may be overwritten when coping the
> > + * new kernel to its final location.  To assure that the relocate_new_kernel
> > + * routine which does that copy is not overwritten all code and data needed
> > + * by relocate_new_kernel must be between the symbols relocate_new_kernel and
> > + * relocate_new_kernel_end.  The machine_kexec() routine will copy
> > + * relocate_new_kernel to the kexec control_code_page, a special page which
> > + * has been set up to be preserved during the kernel copy operation.
> > + */
> > +
> > +.align 3
> 
> Surely this isn't necessary?

No, the code should be properly aligned.

> > +
> > +.globl relocate_new_kernel
> > +relocate_new_kernel:
> > +
> > +       /* Setup the list loop variables. */
> > +
> > +       ldr     x10, kexec_kimage_head          /* x10 = list entry */
> 
> Any reason for using x10 rather than starting with x0? Or x18, if you
> need to preserve the low registers?
> 
> > +
> > +       mrs     x0, ctr_el0
> > +       ubfm    x0, x0, #16, #19
> > +       mov     x11, #4
> > +       lsl     x11, x11, x0                    /* x11 = dcache line size */
> 
> Any reason we can't use dcache_line_size, given it's a macro?
> 
> > +
> > +       mov     x12, xzr                        /* x12 = segment start */
> > +       mov     x13, xzr                        /* x13 = entry ptr */
> > +       mov     x14, xzr                        /* x14 = copy dest */
> > +
> > +       /* Check if the new kernel needs relocation. */
> > +
> > +       cbz     x10, .Ldone
> > +       tbnz    x10, IND_DONE_BIT, .Ldone
> > +
> > +.Lloop:
> 
> Is there any reason for the '.L' on all of these? We only seem to do
> that in the lib code that was imported from elsewhere, and it doesn't
> match the rest of the arm64 asm.

.L is a local label prefix in gas.  I don't think it would be good to have
these with larger scope.

> > +       and     x15, x10, PAGE_MASK             /* x15 = addr */
> > +
> > +       /* Test the entry flags. */
> > +
> > +.Ltest_source:
> > +       tbz     x10, IND_SOURCE_BIT, .Ltest_indirection
> > +
> > +       /* copy_page(x20 = dest, x21 = src) */
> > +
> > +       mov x20, x14
> > +       mov x21, x15
> > +
> > +1:     ldp     x22, x23, [x21]
> > +       ldp     x24, x25, [x21, #16]
> > +       ldp     x26, x27, [x21, #32]
> > +       ldp     x28, x29, [x21, #48]
> > +       add     x21, x21, #64
> > +       stnp    x22, x23, [x20]
> > +       stnp    x24, x25, [x20, #16]
> > +       stnp    x26, x27, [x20, #32]
> > +       stnp    x28, x29, [x20, #48]
> > +       add     x20, x20, #64
> > +       tst     x21, #(PAGE_SIZE - 1)
> > +       b.ne    1b
> 
> It's a shame we can't reuse copy_page directly. Could we not move the
> body to a macro we can reuse here?

copy_page() also does some memory pre-fetch, which Arun said caused
problems on the APM board.  If that board were available to me for
testing I could investigate, but at this time I will put this suggestion
on my todo list.

> > +
> > +       /* dest += PAGE_SIZE */
> > +
> > +       add     x14, x14, PAGE_SIZE
> > +       b       .Lnext
> > +
> > +.Ltest_indirection:
> > +       tbz     x10, IND_INDIRECTION_BIT, .Ltest_destination
> > +
> > +       /* ptr = addr */
> > +
> > +       mov     x13, x15
> > +       b       .Lnext
> > +
> > +.Ltest_destination:
> > +       tbz     x10, IND_DESTINATION_BIT, .Lnext
> > +
> > +       /* flush segment */
> > +
> > +       bl      .Lflush
> > +       mov     x12, x15
> > +
> > +       /* dest = addr */
> > +
> > +       mov     x14, x15
> > +
> > +.Lnext:
> > +       /* entry = *ptr++ */
> > +
> > +       ldr     x10, [x13]
> > +       add     x13, x13, 8
> 
> This can be:
> 
> 	ldr	x10, [x13], #8
> 
> > +
> > +       /* while (!(entry & DONE)) */
> > +
> > +       tbz     x10, IND_DONE_BIT, .Lloop
> > +
> > +.Ldone:
> > +       /* flush last segment */
> > +
> > +       bl      .Lflush
> > +
> > +       dsb     sy
> > +       isb
> > +       ic      ialluis
> 
> This doesn't look right; we need a dsb and an isb after the instruction
> cache maintenance (or the icache could still be flushing when we branch
> to the new kernel).

OK.

> > +
> > +       /* start_new_kernel */
> > +
> > +       ldr     x4, kexec_kimage_start
> > +       ldr     x0, kexec_dtb_addr
> > +       mov     x1, xzr
> > +       mov     x2, xzr
> > +       mov     x3, xzr
> > +       br      x4
> > +
> > +/* flush - x11 = line size, x12 = start addr, x14 = end addr. */
> > +
> > +.Lflush:
> > +       cbz     x12, 2f
> > +       mov     x0, x12
> > +       sub     x1, x11, #1
> > +       bic     x0, x0, x1
> > +1:     dc      civac, x0
> > +       add     x0, x0, x11
> > +       cmp     x0, x14
> > +       b.lo    1b
> > +2:     ret
> 
> It would be nice if this were earlier in the file, before its callers.

Then we would need to jump over it, which I don't think is
very clean.

> 
> > +
> > +.align 3
> 
> We should have a comment as to why this is needed (to keep the 64-bit
> values below naturally aligned).

I haven't seen such an .align directive comment in any arm64 code yet.

-Geoff