[PATCH v2] makedumpfile: Exclude unnecessary hugepages.
Petr Tesarik
ptesarik at suse.cz
Mon Jun 16 23:34:47 PDT 2014
On Tue, 17 Jun 2014 02:32:51 +0000
Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp> wrote:
> Hello,
>
> This is the v2 patch for hugepage filtering rebased on Petr's
> "Generic multi-page exclusion", thanks Petr.
>
> The current kernel's VMCOREINFO doesn't include necessary values
> for this patch, so we need "-x vmlinux" to enable hugepage filtering.
> I tested this patch on kernel 3.13.
>
> Regarding this, Petr made effort to add the values to VMCOREINFO,
> but it looks suspended:
>
> https://lkml.org/lkml/2014/4/11/349
Actually, I received an Acked-by from Vivek last Wednesday. Oh, wait a
moment, this email went to Andrew Morton, but not to any mailing
list. :-(
> So, we should resume this discussion for this patch. Then,
> I should modify this patch to use PG_head_mask if it's accepted.
I have already experimented with hugepage filtering, but haven't sent
my patches yet, precisely because they depend on a not-yet-confirmed
feature in the kernel.
Anyway, let's take your patch as base. I'll add my comments where I
believe my approach was better/cleaner.
> Thanks
> Atsushi Kumagai
>
>
> From: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>
> Date: Tue, 17 Jun 2014 08:59:44 +0900
> Subject: [PATCH v2] Exclude unnecessary hugepages.
>
> There are 2 types of hugepages in the kernel, the both should be
> excluded as user pages.
>
> 1. Transparent huge pages (THP)
> All the pages are anonymous pages (at least for now), so we should
> just get how many pages are in the corresponding hugepage.
> It can be gotten from the page->lru.prev of the second page in the
> hugepage.
>
> 2. Hugetlbfs pages
> The pages aren't anonymous pages but kind of user pages, we should
> exclude also these pages in any way.
> Luckily, it's possible to detect these pages by looking the
> page->lru.next of the second page in the hugepage. This idea came
> from the kernel's PageHuge().
Good point! My patch didn't take care of hugetlbfs pages.
> The number of pages can be gotten in the same way as THP.
>
> Signed-off-by: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>
> ---
> makedumpfile.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
> makedumpfile.h | 8 ++++
> 2 files changed, 120 insertions(+), 5 deletions(-)
>
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 34db997..d170c54 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -92,6 +92,7 @@ do { \
> } while (0)
>
> static void setup_page_is_buddy(void);
> +static void setup_page_is_hugepage(void);
>
> void
> initialize_tables(void)
> @@ -328,6 +329,18 @@ update_mmap_range(off_t offset, int initial) {
> }
>
> static int
> +page_is_hugepage(unsigned long flags) {
> + if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
> + return isHead(flags);
> + } else if (NUMBER(PG_tail) != NOT_FOUND_NUMBER) {
> + return isTail(flags);
> + }if (NUMBER(PG_compound) != NOT_FOUND_NUMBER) {
> + return isCompound(flags);
> + }
> + return 0;
> +}
> +
> +static int
Since it looks like we'll get the mask in VMCOREINFO, I'd rather use
the mask, and construct it from PG_* flags if there's no VMCOREINFO.
I add long PG_head_mask to struct number_table, define this macro:
#define isCompoundHead(flags) (!!((flags) & NUMBER(PG_head_mask)))
Then I initialize it in get_structure_info like this:
PG_head = get_enum_number("PG_head");
if (PG_head == FAILED_DWARFINFO) {
PG_head = get_enum_number("PG_compound");
if (PG_head == FAILED_DWARFINFO)
return FALSE;
}
NUMBER(PG_head_mask) = 1L << PG_head;
with a fallback in get_value_for_old_linux:
if (NUMBER(PG_head_mask) == NOT_FOUND_NUMBER)
NUMBER(PG_head_mask) = 1L << PG_compound_ORIGINAL;
Also, I prefer to write PG_head_mask to the makedumpfile-generated
VMCOREINFO, so it will have the same fields as the kernel-generated
VMCOREINFO.
> is_mapped_with_mmap(off_t offset) {
>
> if (info->flag_usemmap == MMAP_ENABLE
> @@ -1180,6 +1193,7 @@ get_symbol_info(void)
> SYMBOL_INIT(vmemmap_list, "vmemmap_list");
> SYMBOL_INIT(mmu_psize_defs, "mmu_psize_defs");
> SYMBOL_INIT(mmu_vmemmap_psize, "mmu_vmemmap_psize");
> + SYMBOL_INIT(free_huge_page, "free_huge_page");
>
> return TRUE;
> }
> @@ -1288,11 +1302,19 @@ get_structure_info(void)
>
> ENUM_NUMBER_INIT(PG_lru, "PG_lru");
> ENUM_NUMBER_INIT(PG_private, "PG_private");
> + ENUM_NUMBER_INIT(PG_head, "PG_head");
> + ENUM_NUMBER_INIT(PG_tail, "PG_tail");
> + ENUM_NUMBER_INIT(PG_compound, "PG_compound");
> ENUM_NUMBER_INIT(PG_swapcache, "PG_swapcache");
> ENUM_NUMBER_INIT(PG_buddy, "PG_buddy");
> ENUM_NUMBER_INIT(PG_slab, "PG_slab");
> ENUM_NUMBER_INIT(PG_hwpoison, "PG_hwpoison");
>
> + if (NUMBER(PG_head) == NOT_FOUND_NUMBER &&
> + NUMBER(PG_compound) == NOT_FOUND_NUMBER)
> + /* Pre-2.6.26 kernels did not have pageflags */
> + NUMBER(PG_compound) = PG_compound_ORIGINAL;
> +
> ENUM_TYPE_SIZE_INIT(pageflags, "pageflags");
>
> TYPEDEF_SIZE_INIT(nodemask_t, "nodemask_t");
> @@ -1694,6 +1716,7 @@ write_vmcoreinfo_data(void)
> WRITE_SYMBOL("vmemmap_list", vmemmap_list);
> WRITE_SYMBOL("mmu_psize_defs", mmu_psize_defs);
> WRITE_SYMBOL("mmu_vmemmap_psize", mmu_vmemmap_psize);
> + WRITE_SYMBOL("free_huge_page", free_huge_page);
>
> /*
> * write the structure size of 1st kernel
> @@ -1783,6 +1806,9 @@ write_vmcoreinfo_data(void)
>
> WRITE_NUMBER("PG_lru", PG_lru);
> WRITE_NUMBER("PG_private", PG_private);
> + WRITE_NUMBER("PG_head", PG_head);
> + WRITE_NUMBER("PG_tail", PG_tail);
> + WRITE_NUMBER("PG_compound", PG_compound);
> WRITE_NUMBER("PG_swapcache", PG_swapcache);
> WRITE_NUMBER("PG_buddy", PG_buddy);
> WRITE_NUMBER("PG_slab", PG_slab);
> @@ -2033,6 +2059,7 @@ read_vmcoreinfo(void)
> READ_SYMBOL("vmemmap_list", vmemmap_list);
> READ_SYMBOL("mmu_psize_defs", mmu_psize_defs);
> READ_SYMBOL("mmu_vmemmap_psize", mmu_vmemmap_psize);
> + READ_SYMBOL("free_huge_page", free_huge_page);
>
> READ_STRUCTURE_SIZE("page", page);
> READ_STRUCTURE_SIZE("mem_section", mem_section);
> @@ -2109,6 +2136,9 @@ read_vmcoreinfo(void)
>
> READ_NUMBER("PG_lru", PG_lru);
> READ_NUMBER("PG_private", PG_private);
> + READ_NUMBER("PG_head", PG_head);
> + READ_NUMBER("PG_tail", PG_tail);
> + READ_NUMBER("PG_compound", PG_compound);
> READ_NUMBER("PG_swapcache", PG_swapcache);
> READ_NUMBER("PG_slab", PG_slab);
> READ_NUMBER("PG_buddy", PG_buddy);
> @@ -3283,6 +3313,9 @@ out:
> if (!get_value_for_old_linux())
> return FALSE;
>
> + /* Get page flags for compound pages */
> + setup_page_is_hugepage();
> +
> /* use buddy identification of free pages whether cyclic or not */
> /* (this can reduce pages scan of 1TB memory from 60sec to 30sec) */
> if (info->dump_level & DL_EXCLUDE_FREE)
> @@ -4346,6 +4379,24 @@ out:
> "follow free lists instead of mem_map array.\n");
> }
>
> +static void
> +setup_page_is_hugepage(void)
> +{
> + if (NUMBER(PG_head) != NOT_FOUND_NUMBER) {
> + if (NUMBER(PG_tail) == NOT_FOUND_NUMBER) {
> + /*
> + * If PG_tail is not explicitly saved, then assume
> + * that it immediately follows PG_head.
> + */
> + NUMBER(PG_tail) = NUMBER(PG_head) + 1;
> + }
> + } else if ((NUMBER(PG_compound) != NOT_FOUND_NUMBER)
> + && (info->dump_level & DL_EXCLUDE_USER_DATA)) {
> + MSG("Compound page bit could not be determined: ");
> + MSG("huge pages will NOT be filtered.\n");
> + }
> +}
> +
> /*
> * If using a dumpfile in kdump-compressed format as a source file
> * instead of /proc/vmcore, 1st-bitmap of a new dumpfile must be
> @@ -4660,8 +4711,9 @@ __exclude_unnecessary_pages(unsigned long mem_map,
> mdf_pfn_t pfn_read_start, pfn_read_end;
> unsigned char page_cache[SIZE(page) * PGMM_CACHED];
> unsigned char *pcache;
> - unsigned int _count, _mapcount = 0;
> + unsigned int _count, _mapcount = 0, compound_order = 0;
> unsigned long flags, mapping, private = 0;
> + unsigned long hugetlb_dtor;
>
> /*
> * If a multi-page exclusion is pending, do it first
> @@ -4727,6 +4779,27 @@ __exclude_unnecessary_pages(unsigned long mem_map,
> flags = ULONG(pcache + OFFSET(page.flags));
> _count = UINT(pcache + OFFSET(page._count));
> mapping = ULONG(pcache + OFFSET(page.mapping));
> +
> + if (index_pg < PGMM_CACHED - 1) {
> + compound_order = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
> + + OFFSET(list_head.prev));
> + hugetlb_dtor = ULONG(pcache + SIZE(page) + OFFSET(page.lru)
> + + OFFSET(list_head.next));
> + } else if (pfn + 1 < pfn_end) {
AFAICS this clause is not needed. All compound pages are aligned to its
page order, e.g. the head page of an order-2 compound page is aligned
to a multiple of 4. Since mem_map cache is aligned to PGMM_CACHED,
which is defined as 512 (that is power of 2), a compound page cannot
possibly start on the last PFN of the cache.
I even added a sanity check for the alignment:
if (order && order < sizeof(unsigned long) * 8 &&
(pfn & ((1UL << order) - 1)) == 0)
Ok, the "order" above corresponds to your "compound_order"...
> + unsigned char page_cache_next[SIZE(page)];
> + if (!readmem(VADDR, mem_map, page_cache_next, SIZE(page))) {
> + ERRMSG("Can't read the buffer of struct page.\n");
> + return FALSE;
> + }
> + compound_order = ULONG(page_cache_next + OFFSET(page.lru)
> + + OFFSET(list_head.prev));
> + hugetlb_dtor = ULONG(page_cache_next + OFFSET(page.lru)
> + + OFFSET(list_head.next));
> + } else {
> + compound_order = 0;
> + hugetlb_dtor = 0;
> + }
> +
> if (OFFSET(page._mapcount) != NOT_FOUND_STRUCTURE)
> _mapcount = UINT(pcache + OFFSET(page._mapcount));
> if (OFFSET(page.private) != NOT_FOUND_STRUCTURE)
> @@ -4754,6 +4827,10 @@ __exclude_unnecessary_pages(unsigned long mem_map,
> && !isPrivate(flags) && !isAnon(mapping)) {
> if (clear_bit_on_2nd_bitmap_for_kernel(pfn, cycle))
> pfn_cache++;
> + /*
> + * NOTE: If THP for cache is introduced, the check for
> + * compound pages is needed here.
> + */
I do this differently. I added:
mdf_pfn_t *pfn_counter
Then I set pfn_counter to the appropriate counter, but do not call
clear_bit_on_2nd_bitmap_for_kernel(). At the end of the long
if-else-if-else-if statement I add a final else-clause:
/*
* Page not excluded
*/
else
continue;
If execution gets here, the page is excluded, so I can do:
if (nr_pages == 1) {
if (clear_bit_on_2nd_bitmap_for_kernel(pfn, cycle))
(*pfn_counter)++;
} else {
exclude_range(pfn_counter, pfn, pfn + nr_pages, cycle);
}
What do you think?
Petr Tesarik
More information about the kexec
mailing list