makedumpfile memory usage grows with system memory size

Thu Apr 5 02:52:11 EDT 2012

From: Atsushi Kumagai <kumagai-atsushi at mxc.nes.nec.co.jp>
Subject: Re: makedumpfile memory usage grows with system memory size
Date: Mon, 2 Apr 2012 16:46:51 +0900

> On Fri, 30 Mar 2012 09:51:43 +0900 (   )
> HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com> wrote:

>> For performance impact, I don't know that exactly. But I guess
>> iterating filtering processing is most significant. I don't know exact
>> data structure for each kind of memory, but if there's the ones
>> needing linear order to look up the data for a given page frame
>> number, there would be necessary to add some special handling not to
>> reduce performance.

> 
> Thank you for your idea.
> 
> I think this is an important issue and I have no idea except iterating
> filtering processes for each memory range.
> 
> But as you said, we should consider the issue related to performance.
> For example, makedumpfile must parse free_list repeatedly to distinguish
> whether each pfn is a free page or not, because each range may be inside
> the same zone. It will be overhead.
> 

Hello Kumagai-san,

I looked into contents of free_list and confirmed that even buddies
with the same order are not ordered linearly. The below is the output
of makedumpfile I customized so it outputs buddy data.

# ./makedumpfile --message-level 32 -c -d 31 /media/127.0.0.1-2012-04-04-20:31:58/vmcore vmcore-cd31
NR_ZONE: 0
order: 10 migrate_type: 2 pfn: 3072
order: 10 migrate_type: 2 pfn: 2048
order: 10 migrate_type: 2 pfn: 1024
order: 9 migrate_type: 3 pfn: 512
order: 8 migrate_type: 0 pfn: 256
order: 6 migrate_type: 0 pfn: 64
order: 5 migrate_type: 0 pfn: 32
order: 4 migrate_type: 0 pfn: 128
order: 4 migrate_type: 0 pfn: 16
order: 2 migrate_type: 0 pfn: 144
order: 1 migrate_type: 0 pfn: 148
NR_ZONE: 1
order: 10 migrate_type: 2 pfn: 226304
order: 10 migrate_type: 2 pfn: 225280
order: 10 migrate_type: 2 pfn: 486400
order: 10 migrate_type: 2 pfn: 485376
order: 10 migrate_type: 2 pfn: 484352
order: 10 migrate_type: 2 pfn: 483328
order: 10 migrate_type: 2 pfn: 482304
order: 10 migrate_type: 2 pfn: 481280
<snip>

We cannot choose the way of simply walking free_list in the increasing
order w.r.t. pfn for a given range of memory, suspend the walking and
save the data for the next walking...

So, it's necessary to create a table for access in constant time. But
for that, the table needs to be created on the memory. On the 2nd
kernel, we cannot assume any backing store in general: consider scp
for example.

I think basic idea would be several efforts for small memory
programming, like:

  * Create part of bitmap corresponding to range of memory currently
    being processed only, and table creation processing is repeated
    each time range of memory is started.
    => difficult to avoid looking up a whole part of free_list every
    time, but this is only idea I come up with that makes it always
    possible that consumed memory is stably constant.

  * Have table in memory mapping form rather than bitmap, switch back
    to bitmap if the size gets larger than the bitmap's
    => bad performance on very fragmented case, and constructing
    memory mapping requires O(n^2) so would cost high if doing it
    multiple times.

  * Compress part of bitmap except for the one currently being
    processed
    => bad performance when compression doesn't work well
       bad performance when compression is done too many times

But before that, I want to also consider possibility of increasing
reserved memory for the 2nd kernel.

On the discussion of 512MB reservation regression last month, Vivek
explained that 512MB is current maximam value and enough for at most
6TB system.

  https://lkml.org/lkml/2012/3/13/372

But on such machine, where makedumpfile perforamce is affected, there
seems to be a room to reserve more 512MB memory. Also Yinghai said
following Vivek, system memory size will still grow in next years.

Note:
  * 1 bit in bitmap represents 1 page frame. On x86, 1 byte is for
    32kB memory. 1TB memory requres 32MB. Dump includes two bitmaps so
    64MB is needed in total.
  * Bad performance is free pages only. Cache, cache private, user and
    zero pages are processed per range of memory in good performance.

Thanks.
HATAYAMA, Daisuke