[PATCH v3 1/2] kexec: arm64: create identity page table to be used in purgatory
Pratyush Anand
panand at redhat.com
Fri Jun 2 07:29:09 PDT 2017
Hi James,
Thanks for taking out time to review the code.
On Friday 02 June 2017 01:54 PM, James Morse wrote:
> Hi Pratyush,
>
> On 23/05/17 06:02, Pratyush Anand wrote:
>> Purgatory sha verification is very slow when D-cache is not enabled there.
>> We need to enable MMU as well to enable D-Cache.Therefore,we need to
>> have an identity mapped page table in purgatory.
>>
>> Since debugging is very difficult in purgatory therefore we prefer to do as
>> much work as possible in kexec.
>>
>> This patch prepares page table for purgatory in advance. We support only 4K
>> page table,because it will be available on most of the platform. This page
>> table is passed to the purgatory as a new segment.
>>
>> VA bit is fixed as 48, page table level is 3 where 3rd level page table
>> contains 2M block entries.
>
> This had me confused: You actually have 4 levels of page tables, but the last
> level is never used, because the third level always contains block entries.
>
> If you want to use the kernel's macros for shifts and masks (which I think you
> should), you must use the same terminology.
>
> Because you wrote this:
>> +#define PGTABLE_LEVELS 3
>
> The calculated PGDIR_SHIFT is what the kernel would call PUD_SHIFT with 48bit VA.
>
>> +#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3)
>> +#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - PGTABLE_LEVELS)
>
> Which then explains why you have to invent a new name for PGD_SHIFT, and
> calculate it manually:
>> +#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
>
>
> kexec-tools is closely coupled to the kernel, can we try to make this as similar
> as possible so that code (and reviewers!) can easily move between the two.
OK..I will try to sort out the differences
...
>
> (some other comments below)
>
> Thanks,
>
>
> James
>
>> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c
>> index 62f37585b788..af5dab331e97 100644
>> --- a/kexec/arch/arm64/kexec-arm64.c
>> +++ b/kexec/arch/arm64/kexec-arm64.c
>> @@ -33,6 +34,10 @@
>> #define PROP_ELFCOREHDR "linux,elfcorehdr"
>> #define PROP_USABLE_MEM_RANGE "linux,usable-memory-range"
>>
>> +typedef unsigned long guest_addr_t;
>> +typedef unsigned long host_addr_t;
>
> What do guest and host mean here?
guest is something which will be in purgatory domain and host is something in
kexec domain. I remember, you had suggested to differentiate these two address
domain for better code readability.
> I assumed everything that happens in purgatory would have something like
> phys_addr_t (for both physical and virtual addresses).
Yes, so guest_addr_t is like phys_addr_t. I can replace, if there are some
better names to differentiate between these two address domain.
>
>
>> +static int next_tbl_cnt = 1;
>> +
>> /* Global varables the core kexec routines expect. */
>
> Nit: variables
OK.
>
>
>>
>> unsigned char reuse_initrd;
>> @@ -519,6 +524,118 @@ unsigned long arm64_locate_kernel_segment(struct kexec_info *info)
>> return hole;
>> }
>>
>> +static host_addr_t *create_table_entry(host_addr_t *pgtbl_buf,
>> + guest_addr_t pgtbl_mem, host_addr_t *tbl,
>> + guest_addr_t virt, int shift,
>> + unsigned long ptrs)
>> +{
>> + unsigned long index, desc, offset;
>> +
>> + index = (virt >> shift) & (ptrs - 1);
>> + /* check if we have allocated a table already for this index */
>> + if (tbl[index]) {
>> + /*
>> + * table index will have entry as per purgatory (guest)page
>> + * table memory. Find out corresponding buffer address of
>> + * table (host).
>> + */
>> + desc = tbl[index] & ~3UL;
>> + offset = desc - pgtbl_mem;
>> + return &pgtbl_buf[offset >> 3];
>> + }
>> +
>> + if (next_tbl_cnt >= MAX_PGTBLE_SZ / PAGE_SIZE)
>> + die("%s:No more memory for page table\n", __func__);
>> + /*
>> + * Always write page table content as per page table memory
>> + * allocated for purgatory(guest) area, but return corresponding
>> + * buffer area allocated in kexec (host).
>> + */
>> +
>> + tbl[index] = (pgtbl_mem + PAGE_SIZE * next_tbl_cnt) | PMD_TYPE_TABLE;
>> +
>> + return &pgtbl_buf[(next_tbl_cnt++ * PAGE_SIZE) >> 3];
>> +}
>> +
>> +static void create_block_entry(host_addr_t *tbl, unsigned long flags,
>> + guest_addr_t virt)
>> +{
>> + unsigned long index;
>> + unsigned long desc;
>> +
>> + index = pmd_index(virt);
>> + desc = virt & PMD_MASK;
>> + desc |= flags;
>> + tbl[index] = desc;
>> +}
>> +
>> +static void create_identity_entry(host_addr_t *pgtbl_buf,
>> + guest_addr_t pgtbl_mem, guest_addr_t virt,
>> + unsigned long flags)
>> +{
>> + host_addr_t *tbl = pgtbl_buf;
>> + tbl = create_table_entry(pgtbl_buf, pgtbl_mem, tbl, virt,
>> + EXTRA_SHIFT, EXTRA_PTRS);
>> + tbl = create_table_entry(pgtbl_buf, pgtbl_mem, tbl, virt,
>> + IDENTITY_SHIFT, PTRS_PER_PTE);
>
> Are 'extra' and 'identity' names for levels in the page tables? If so can we use
> the same names as the kernel.
>
OK,will make it similar as per 4 level page (so 3 level swapper page) in kernel.
>
>> + create_block_entry(tbl, flags, virt);
>> +}
>> +
>> +/**
>> + * arm64_create_pgtbl_segment - Create page table segments to be used by
>> + * purgatory. Page table will have entries to access memory area of all
>> + * those segments which becomes part of sha verification in purgatory.
>> + * Additionally, we also create page table for purgatory segment as well.
>> + */
>> +
>> +static int arm64_create_pgtbl_segment(struct kexec_info *info,
>> + unsigned long hole_min, unsigned long hole_max)
>> +{
>> + host_addr_t *pgtbl_buf;
>> + guest_addr_t pgtbl_mem, mstart, mend;
>> + int i;
>> + unsigned long purgatory_base, purgatory_len;
>> +
>> + pgtbl_buf = xmalloc(MAX_PGTBLE_SZ);
>> + memset(pgtbl_buf, 0, MAX_PGTBLE_SZ);
>> + pgtbl_mem = add_buffer_phys_virt(info, pgtbl_buf, MAX_PGTBLE_SZ,
>> + MAX_PGTBLE_SZ, PAGE_SIZE, hole_min, hole_max, 1, 0);
>> + for (i = 0; i < info->nr_segments; i++) {
>> + if (info->segment[i].mem == (void *)info->rhdr.rel_addr) {
>> + purgatory_base = (unsigned long)info->segment[i].mem;
>> + purgatory_len = info->segment[i].memsz;
>> + }
>> + mstart = (unsigned long)info->segment[i].mem;
>> + mend = mstart + info->segment[i].memsz;
>> + mstart &= ~(SECTION_SIZE - 1);
>> + while (mstart < mend) {
>> + create_identity_entry(pgtbl_buf, pgtbl_mem,
>> + mstart, MMU_FLAGS_NORMAL);
>> + mstart += SECTION_SIZE;
>> + }
>> + }
>> +
>> + /* we will need pgtble_base in purgatory for enabling d-cache */
>> + elf_rel_set_symbol(&info->rhdr, "pgtble_base", &pgtbl_mem,
>> + sizeof(pgtbl_mem));
>> + /*
>> + * We need to disable d-cache before we exit from purgatory.
>> + * Since, only d-cache flush by VAs is recommended, therefore we
>> + * will also need memory location of all those area which will be
>> + * accessed in purgatory with enabled d-cache. sha256_regions
>> + * already have start and length for all the segments except
>> + * purgatory. Therefore, we will need to pass start and length of
>> + * purgatory additionally.n
>> + */
>> + elf_rel_set_symbol(&info->rhdr, "purgatory_base", &purgatory_base,
>> + sizeof(purgatory_base));
>> + elf_rel_set_symbol(&info->rhdr, "purgatory_len", &purgatory_len,
>> + sizeof(purgatory_len));
>> +
>> + return 0;
>> +}
>> +
>> /**
>> * arm64_load_other_segments - Prepare the dtb, initrd and purgatory segments.
>> */
>> @@ -630,6 +747,8 @@ int arm64_load_other_segments(struct kexec_info *info,
>> elf_rel_set_symbol(&info->rhdr, "arm64_dtb_addr", &dtb_base,
>> sizeof(dtb_base));
>>
>> + arm64_create_pgtbl_segment(info, hole_min, hole_max);
>
> hole_min->hole_max is the region from the end of the Kernel image, to the end of
> usable memory. I can't see how this window is made smaller as the dtb, initramfs
> then page tables are added to it... (kvmtool updates the equivalent values as it
> adds things)
>
> Does kexec-tools have some other accounting for what memory is in use? (in which
> case why are these values needed?)
>
Please see locate_hole() and add_segment_phys_virt().
locate_hole() finds a hole from info->memory_range lying between hole_min and
hole_max. add_segment_phys_virt() assigned allocated hole to info->segment.
locate_hole() uses info->segment as well to know about consumed regions.
>
>> return 0;
>> }
>>
>> diff --git a/kexec/arch/arm64/kexec-mmu.h b/kexec/arch/arm64/kexec-mmu.h
>> new file mode 100644
>> index 000000000000..55354b5e3002
>> --- /dev/null
>> +++ b/kexec/arch/arm64/kexec-mmu.h
>> @@ -0,0 +1,58 @@
>> +#if !defined(KEXEC_MMU_H)
>> +#define KEXEC_MMU_H
>
> As lots of the symbols names and values below were copied from the kernel, we
> should copy the copyright statement too. Something like:
>
> /*
> * Based on linux v4.11's arch/arm64/include/asm/pgtable-hwdef.h
> * Copyright (C) 2012 ARM Ltd.
> */
>
OK
>
>> +/*
>> + * kexec creates identity page table to be used in purgatory so that
>
>> + * dcache verification becomes faster.
>
> ... so that SHA verification of the image can make use of the dcache.
OK.. :-)
>
>
>> + *
>> + * These are the definitions to be used by page table creation routine.
>> + *
>> + * Only 4K page table, 3 level, 2M block mapping, 48bit VA is supported
>> + */
>> +#define PAGE_SHIFT 12
>> +#define PGTABLE_LEVELS 3
>> +#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3)
>> +#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - PGTABLE_LEVELS)
>
>> +#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
>> +#define EXTRA_PTRS (1 << (48 - EXTRA_SHIFT))
>
> This looks like the level above PGD? What do you need this for if you have only
> 3 levels of page tables?
As said above, will remove these names now.
>
>
>> +#define PUD_SHIFT PGDIR_SHIFT
>> +#define PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
>> +#define PMD_SIZE (1UL << PMD_SHIFT)
>> +#define PMD_MASK (~(PMD_SIZE-1))
>> +#define IDENTITY_SHIFT PUD_SHIFT
>> +#define PTRS_PER_PTE (1 << (PAGE_SHIFT - 3))
>> +#define PTRS_PER_PMD PTRS_PER_PTE
>> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>> +#define PMD_TYPE_TABLE (3UL << 0)
>> +#define PMD_TYPE_SECT (1UL << 0)
>> +#define PMD_SECT_AF (1UL << 10)
>> +#define PMD_ATTRINDX(t) ((unsigned long)(t) << 2)
>> +#define MT_NORMAL 0
>> +#define PMD_FLAGS_NORMAL (PMD_TYPE_SECT | PMD_SECT_AF)
>> +#define MMU_FLAGS_NORMAL (PMD_ATTRINDX(MT_NORMAL) | PMD_FLAGS_NORMAL)
>> +#define SECTION_SHIFT PMD_SHIFT
>> +#define SECTION_SIZE (1UL << SECTION_SHIFT)
>> +#define PAGE_SIZE (1 << PAGE_SHIFT)
>> +/*
>> + * Since we are using 3 level of page tables, therefore minimum number of
>> + * table will be 3. Each entry in level 3 page table can map 2MB memory
>> + * area. Thus a level 3 page table indexed by bit 29:21 can map a total of
>> + * 1G memory area. Therefore, if any segment crosses 1G boundary, then we
>> + * will need one more level 3 table. Similarly, level 2 page table indexed
>> + * by bit 38:30 can map a total of 512G memory area. If segment addresses
>> + * are more than 512G apart then we will need two more table for each such
>> + * block.
>> + * Lets consider 2G as the maximum size of crashkernel.
>
> crashkernel+initramfs ?
OK
>
> A good assumption. Can we test this is true when the image is loaded?
> (otherwise its going to be really fun to debug!)
>
That we already do. If we cross that boundary we will need additional space
and it will die() during kexec load.
if (next_tbl_cnt >= MAX_PGTBLE_SZ / PAGE_SIZE)
die("%s:No more memory for page table\n", __func__);
[It is difficult to implement, get going without MMU enabled in that case]
>
>> This 2G memory
>> + * location might not be allocated at 1G aligned boundary, so in that case
>> + * we need to have 5 table size resrved to map any location of the crash
>
> (Nit: reserved)
OK.
>
>
>> + * kernel region.
>> + *
>> + * If we will ever wish to support uart debugging in purgatory then that
>> + * might cross the boundary and therefore additional 2 more table space. In
>> + * that case, we will need a total of 7 table space.
>> + *
>> + * As of now keep it fixed at 5. Increase it if any platform either
>> + * supports uart or more than 2G of crash kernel size.
>> + */
>> +#define MAX_PGTBLE_SZ (5 * PAGE_SIZE)
>> +
>> +#endif
>>
~Pratyush
More information about the kexec
mailing list