[EXTERNAL] [RFC PATCH v3 01/20] x86/kexec: Ensure control_code_page is mapped in kexec page tables

David Woodhouse dwmw2 at infradead.org
Mon Nov 25 02:29:17 PST 2024


On Mon, 2024-11-25 at 09:54 +0000, David Woodhouse wrote:
> From: David Woodhouse <dwmw at amazon.co.uk>
> 
> The control_code_page should be explicitly mapped into the identity
> mapped page tables for the relocate_kernel environment. This only seems
> to have worked by luck before, because it tended to be within the same
> 2MiB or 1GiB large page already mapped for another reason.
> 
> A subsequent commit will reduce the control_code_page to a single 4KiB
> page instead of a higher-order allocation, and seems to make it much
> *less* likely that we get lucky with its placement. This leads to a
> fault when relocate_kernel() first tries to access the page through its
> identity-mapped virtual address.

This one is confusing me. Jan points out that it shouldn't be needed,
because the control page should come from kernel memory and thus should
be mapped anyway because the loop immediately below my added code adds
*all* of the pfn_mapped[] ranges.

And from code inspection he appears to be right, but if I disable the
new mapping and add some printks...

--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -247,15 +247,18 @@ static int init_pgtable(struct kimage *image, unsigned long control_page)
                info.direct_gbpages = true;
 
        /* Ensure the control code page itself is in the direct map */
+       pr_info("No Map control page at %lx", control_page);
+#if 0
        result = kernel_ident_mapping_init(&info, image->arch.pgd, control_page,
                                           control_page + KEXEC_CONTROL_CODE_MAX_SIZE);
        if (result)
                return result;
-
+#endif
        for (i = 0; i < nr_pfn_mapped; i++) {
                mstart = pfn_mapped[i].start << PAGE_SHIFT;
                mend   = pfn_mapped[i].end << PAGE_SHIFT;
 
+               pr_info("Map pfn_mapped[%d] %lx - %lx\n", i, mstart, mend);
                result = kernel_ident_mapping_init(&info, image->arch.pgd,
                                                   mstart, mend);
                if (result)


... and run in a version of qemu which dumps the CPU state on triple-
fault...

+ ./loadret
[    0.948097] kexec: No Map control page at 2b32000
[    0.948103] kexec: Map pfn_mapped[0] 0 - 7ffdd000
[    0.960192] Freezing user space processes
[    0.961685] Freezing user space processes completed (elapsed 0.001 seconds)
[    0.962372] OOM killer disabled.
[    1.088668] ata2: found unknown device (class 0)
[    1.095810] Disabling non-boot CPUs ...
[    1.117990] smpboot: CPU 1 is now offline
[    1.118595] crash hp: kexec_trylock() failed, kdump image may be inaccurate
RAX=0000000080050033 RBX=0000000000000000 RCX=0000000000000001 RDX=0000000000400000
RSI=0000000002b3205a RDI=0000000003a44002 RBP=ffff9709c2109400 RSP=0000000002b33000
R8 =0000000000000000 R9 =00000000038a0000 R10=0000000000000000 R11=0000000000000001
R12=0000000000000000 R13=0000000000170ef0 R14=00000000fee1dead R15=0000000000000000
RIP=ffff9709c2b32057 RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
LDT=0000 0000000000000000 00000000 00000000
TR =0040 fffffe2fb91b2000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     0000000000000000 00000000
IDT=     0000000000000000 00000000
CR0=80050033 CR2=0000000002b32ff8 CR3=00000000038a0000 CR4=00170ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=04 00 00 49 89 cb 48 8d a6 00 10 00 00 48 81 c6 5a 00 00 00 <56> c3 cc 6a 00 52 48 8d 05 8c 04 00 00 50 66 ff 30 0f 01 14 24 48 83 c4 0a 8c d8 8e d8 48


RIP xxx057 is here, where relocate_kernel first touches the 1:1 mapping of the control page:

        /* setup a new stack at the end of the physical control page */
        lea     PAGE_SIZE(%rsi), %rsp
  49:   48 8d a6 00 10 00 00    lea    0x1000(%rsi),%rsp

        /* jump to identity mapped page */
        addq    $(identity_mapped - relocate_kernel), %rsi
  50:   48 81 c6 5a 00 00 00    add    $0x5a,%rsi
        pushq   %rsi
  57:   56                      push   %rsi

The control page at 2b32xxx *really* ought to be mapped, as it's
clearly within the 0 - 7ffdd000 range. What's going on?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5965 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20241125/b12775ae/attachment-0001.p7s>


More information about the kexec mailing list