[PATCH] makedumpfile: keep dumpfile pages in a cache

Thu Sep 6 11:50:52 EDT 2012

Dne Po 3. září 2012 09:04:03 Petr Tesarik napsal(a):
> Dne Po 3. září 2012 05:42:33 Atsushi Kumagai napsal(a):
> > Hello Petr,
> > 
> > On Tue, 28 Aug 2012 19:49:49 +0200
> > 
> > Petr Tesarik <ptesarik at suse.cz> wrote:
> > > Add a simple cache for pages read from the dumpfile.
> > > 
> > > This is a big win if we read consecutive data from one page, e.g.
> > > page descriptors, or even page table entries.
> > > 
> > > Note that makedumpfile now always reads a complete page. This was
> > > already the case with kdump-compressed and sadump formats, but
> > > makedumpfile was throwing most of the data away. For the
> > > kdump-compressed case, we may actually save a lot of decompression,
> > > too.
> > > 
> > > I tried to keep the cache small to minimize memory footprint, but it
> > > should be big enough to hold all pages to do 4-level paging plus some
> > > data. This is needed e.g. for vmalloc areas or Xen page frame table
> > > data, which are not contiguous in physical memory.
> > > 
> > > Signed-off-by: Petr Tesarik <ptesarik at suse.cz>
> > 
> > It's interesting to me. I want to know how performance will be improved
> > with this patch, so do you have speed measurements ?
> 
> Not really. I only measured the hit/miss ratio, and with filtering Xen domU
> and dump level 0, I got the following on a small system (2G RAM):
> 
> cache hit: 1818880  cache miss: 1873
> 
> The improvement isn't much for non-Xen case, because the hits are mostly
> due to virtual-to-physical translations, and most of Linux data is stored
> at virtual addresses that can be resolved by adding/subtracting a fixed
> offset.
> 
> Of course, you will also win only the syscall overhead, because Linux keeps
> the data in the kernel pagecache anyway. I'll measure the times for you on
> a reasonably large system (~256G) and send the results here.

I couldn't get a medium-sized system for testing, so I performed some 
measurements on a 64G system. I ran makedumpfile repeatedly from the kdump 
environment. First run was used to cache target filesystem metadata, and the 
cache was not dropped between runs to minimize effects of the target 
filesystem. I ran it against /proc/vmcore, i.e. the input file was always 
resident, nothing to skew the results.

I tried with a kdump file with no compression (to get gzip/LZO out of the 
picture) and an ELF file. For the Xen case I only did the ELF file, because 
kdump is not available.

First I ran it on bare metal. There was a slight improvement for -d31:

kdump no cache:
6.32user 55.20system 1:15.60elapsed 81%CPU (0avgtext+0avgdata 
4800maxresident)k
2080inputs+5714296outputs (2major+342minor)pagefaults 0swaps

kdump with cache:
6.02user 24.58system 0:46.51elapsed 65%CPU (0avgtext+0avgdata 
4912maxresident)k
1864inputs+5714288outputs (2major+350minor)pagefaults 0swaps

ELF no cache:
7.58user 74.25system 1:59.52elapsed 68%CPU (0avgtext+0avgdata 
4800maxresident)k
728inputs+9288824outputs (1major+342minor)pagefaults 0swaps

ELF with cache:
7.43user 44.21system 1:17.41elapsed 66%CPU (0avgtext+0avgdata 
4896maxresident)k
728inputs+9288792outputs (1major+349minor)pagefaults 0swaps

To sum it up, I can see an improvement of approx. 50% in system time. The 
increase in memory consumption is a bit more than I would expect (why do I see 
~100k for a cache of 12k?), but acceptable nevertheless. I can see a slight 
increase in user time (approx. 25%) for the kdump case, which could be 
attributed to the cache overhead. I don't have any explanation for the 
decreased user time for the ELF case, but it's consistent.

I also tried running makedumpfile with -d1. This results in long sequential 
reads, so it's the worst case for a simple LRU-policy cache. The results are 
too unstable to make a reliable measurement, but there seems to be a slight 
performance hit. It is certainly less than 5% total time.

I think there are two reasons for that:

1. We're copying file data twice for each page (once from the kernel page 
cache to the process space, and once from the internal cache to the 
destination).
2. Instead of reusing the same data location, we're rotating 8 different pages 
(or even up to twice as much if the allocated space is neither continuous  nor 
page-aligned). This stresses both for the CPU's L1 d-cache and the TLB a tiny 
bit more. Note that in the /proc/vmcore case, the kernel sequentially maps all 
physical memory of the crashed system, so every cache page may be evicted 
before we get to using it again. This could explain why I observe an increase 
in system time despite making less system calls.

There's a lot of things I could do to regain the old performance, if anybody 
is concerned about the slight performance regression for this worst case. Just 
let me know.

Second, I ran with the Xen hypervisor. Since dump levels greater than 1 don't 
work, I ran with '-E -X -d1'. Even though this includes the inefficient page 
walk described above, the improvement was immense.

no cache:
95.33user 657.18system 13:08.40elapsed 95%CPU (0avgtext+0avgdata 
5440maxresident)k
704inputs+6563856outputs (1major+388minor)pagefaults 0swaps

with cache:
61.14user 110.15system 3:24.24elapsed 83%CPU (0avgtext+0avgdata 
5584maxresident)k
2360inputs+6563872outputs (2major+396minor)pagefaults 0swaps

In short, almost 80% shorter total time.

Petr Tesarik
SUSE Linux