[PATCH] makedumpfile: keep dumpfile pages in a cache
ptesarik at suse.cz
Thu Sep 6 11:50:52 EDT 2012
Dne Po 3. září 2012 09:04:03 Petr Tesarik napsal(a):
> Dne Po 3. září 2012 05:42:33 Atsushi Kumagai napsal(a):
> > Hello Petr,
> > On Tue, 28 Aug 2012 19:49:49 +0200
> > Petr Tesarik <ptesarik at suse.cz> wrote:
> > > Add a simple cache for pages read from the dumpfile.
> > >
> > > This is a big win if we read consecutive data from one page, e.g.
> > > page descriptors, or even page table entries.
> > >
> > > Note that makedumpfile now always reads a complete page. This was
> > > already the case with kdump-compressed and sadump formats, but
> > > makedumpfile was throwing most of the data away. For the
> > > kdump-compressed case, we may actually save a lot of decompression,
> > > too.
> > >
> > > I tried to keep the cache small to minimize memory footprint, but it
> > > should be big enough to hold all pages to do 4-level paging plus some
> > > data. This is needed e.g. for vmalloc areas or Xen page frame table
> > > data, which are not contiguous in physical memory.
> > >
> > > Signed-off-by: Petr Tesarik <ptesarik at suse.cz>
> > It's interesting to me. I want to know how performance will be improved
> > with this patch, so do you have speed measurements ?
> Not really. I only measured the hit/miss ratio, and with filtering Xen domU
> and dump level 0, I got the following on a small system (2G RAM):
> cache hit: 1818880 cache miss: 1873
> The improvement isn't much for non-Xen case, because the hits are mostly
> due to virtual-to-physical translations, and most of Linux data is stored
> at virtual addresses that can be resolved by adding/subtracting a fixed
> Of course, you will also win only the syscall overhead, because Linux keeps
> the data in the kernel pagecache anyway. I'll measure the times for you on
> a reasonably large system (~256G) and send the results here.
I couldn't get a medium-sized system for testing, so I performed some
measurements on a 64G system. I ran makedumpfile repeatedly from the kdump
environment. First run was used to cache target filesystem metadata, and the
cache was not dropped between runs to minimize effects of the target
filesystem. I ran it against /proc/vmcore, i.e. the input file was always
resident, nothing to skew the results.
I tried with a kdump file with no compression (to get gzip/LZO out of the
picture) and an ELF file. For the Xen case I only did the ELF file, because
kdump is not available.
First I ran it on bare metal. There was a slight improvement for -d31:
kdump no cache:
6.32user 55.20system 1:15.60elapsed 81%CPU (0avgtext+0avgdata
2080inputs+5714296outputs (2major+342minor)pagefaults 0swaps
kdump with cache:
6.02user 24.58system 0:46.51elapsed 65%CPU (0avgtext+0avgdata
1864inputs+5714288outputs (2major+350minor)pagefaults 0swaps
ELF no cache:
7.58user 74.25system 1:59.52elapsed 68%CPU (0avgtext+0avgdata
728inputs+9288824outputs (1major+342minor)pagefaults 0swaps
ELF with cache:
7.43user 44.21system 1:17.41elapsed 66%CPU (0avgtext+0avgdata
728inputs+9288792outputs (1major+349minor)pagefaults 0swaps
To sum it up, I can see an improvement of approx. 50% in system time. The
increase in memory consumption is a bit more than I would expect (why do I see
~100k for a cache of 12k?), but acceptable nevertheless. I can see a slight
increase in user time (approx. 25%) for the kdump case, which could be
attributed to the cache overhead. I don't have any explanation for the
decreased user time for the ELF case, but it's consistent.
I also tried running makedumpfile with -d1. This results in long sequential
reads, so it's the worst case for a simple LRU-policy cache. The results are
too unstable to make a reliable measurement, but there seems to be a slight
performance hit. It is certainly less than 5% total time.
I think there are two reasons for that:
1. We're copying file data twice for each page (once from the kernel page
cache to the process space, and once from the internal cache to the
2. Instead of reusing the same data location, we're rotating 8 different pages
(or even up to twice as much if the allocated space is neither continuous nor
page-aligned). This stresses both for the CPU's L1 d-cache and the TLB a tiny
bit more. Note that in the /proc/vmcore case, the kernel sequentially maps all
physical memory of the crashed system, so every cache page may be evicted
before we get to using it again. This could explain why I observe an increase
in system time despite making less system calls.
There's a lot of things I could do to regain the old performance, if anybody
is concerned about the slight performance regression for this worst case. Just
let me know.
Second, I ran with the Xen hypervisor. Since dump levels greater than 1 don't
work, I ran with '-E -X -d1'. Even though this includes the inefficient page
walk described above, the improvement was immense.
95.33user 657.18system 13:08.40elapsed 95%CPU (0avgtext+0avgdata
704inputs+6563856outputs (1major+388minor)pagefaults 0swaps
61.14user 110.15system 3:24.24elapsed 83%CPU (0avgtext+0avgdata
2360inputs+6563872outputs (2major+396minor)pagefaults 0swaps
In short, almost 80% shorter total time.
More information about the kexec