[PATCH] makedumpfile: keep dumpfile pages in a cache

Tue Nov 13 22:47:24 EST 2012

Hello Petr,

On Thu, 6 Sep 2012 17:50:52 +0200
Petr Tesarik <ptesarik at suse.cz> wrote:

> Dne Po 3. září 2012 09:04:03 Petr Tesarik napsal(a):
> > Dne Po 3. září 2012 05:42:33 Atsushi Kumagai napsal(a):
> > > Hello Petr,
> > > 
> > > On Tue, 28 Aug 2012 19:49:49 +0200
> > > 
> > > Petr Tesarik <ptesarik at suse.cz> wrote:
> > > > Add a simple cache for pages read from the dumpfile.
> > > > 
> > > > This is a big win if we read consecutive data from one page, e.g.
> > > > page descriptors, or even page table entries.
> > > > 
> > > > Note that makedumpfile now always reads a complete page. This was
> > > > already the case with kdump-compressed and sadump formats, but
> > > > makedumpfile was throwing most of the data away. For the
> > > > kdump-compressed case, we may actually save a lot of decompression,
> > > > too.
> > > > 
> > > > I tried to keep the cache small to minimize memory footprint, but it
> > > > should be big enough to hold all pages to do 4-level paging plus some
> > > > data. This is needed e.g. for vmalloc areas or Xen page frame table
> > > > data, which are not contiguous in physical memory.
> > > > 
> > > > Signed-off-by: Petr Tesarik <ptesarik at suse.cz>

Sorry for the late reply.
According to your measurement, it looks good on performance.

However, I found the issue below in v1.5.1-beta and made sure that this patch
causes it by git bisect (but I don't find the true cause yet).

  result on kernel 3.4:
    $ makedumpfile --non-cyclic vmcore dumpfile
    Copying data                       : [ 62 %] 
    readpage_elf: Can't convert a physical address(a0000) to offset.
    readmem: type_addr: 1, addr:1000a0000, size:4096
    read_pfn: Can't get the page data.

    makedumpfile Failed.
    $

It seems critical issue for all users, so I will postpone merging this patch
until this issue is solved.

Thanks
Atsushi Kumagai

> > > 
> > > It's interesting to me. I want to know how performance will be improved
> > > with this patch, so do you have speed measurements ?
> > 
> > Not really. I only measured the hit/miss ratio, and with filtering Xen domU
> > and dump level 0, I got the following on a small system (2G RAM):
> > 
> > cache hit: 1818880  cache miss: 1873
> > 
> > The improvement isn't much for non-Xen case, because the hits are mostly
> > due to virtual-to-physical translations, and most of Linux data is stored
> > at virtual addresses that can be resolved by adding/subtracting a fixed
> > offset.
> > 
> > Of course, you will also win only the syscall overhead, because Linux keeps
> > the data in the kernel pagecache anyway. I'll measure the times for you on
> > a reasonably large system (~256G) and send the results here.
> 
> I couldn't get a medium-sized system for testing, so I performed some 
> measurements on a 64G system. I ran makedumpfile repeatedly from the kdump 
> environment. First run was used to cache target filesystem metadata, and the 
> cache was not dropped between runs to minimize effects of the target 
> filesystem. I ran it against /proc/vmcore, i.e. the input file was always 
> resident, nothing to skew the results.
> 
> I tried with a kdump file with no compression (to get gzip/LZO out of the 
> picture) and an ELF file. For the Xen case I only did the ELF file, because 
> kdump is not available.
> 
> First I ran it on bare metal. There was a slight improvement for -d31:
> 
> kdump no cache:
> 6.32user 55.20system 1:15.60elapsed 81%CPU (0avgtext+0avgdata 
> 4800maxresident)k
> 2080inputs+5714296outputs (2major+342minor)pagefaults 0swaps
> 
> kdump with cache:
> 6.02user 24.58system 0:46.51elapsed 65%CPU (0avgtext+0avgdata 
> 4912maxresident)k
> 1864inputs+5714288outputs (2major+350minor)pagefaults 0swaps
> 
> ELF no cache:
> 7.58user 74.25system 1:59.52elapsed 68%CPU (0avgtext+0avgdata 
> 4800maxresident)k
> 728inputs+9288824outputs (1major+342minor)pagefaults 0swaps
> 
> ELF with cache:
> 7.43user 44.21system 1:17.41elapsed 66%CPU (0avgtext+0avgdata 
> 4896maxresident)k
> 728inputs+9288792outputs (1major+349minor)pagefaults 0swaps
> 
> To sum it up, I can see an improvement of approx. 50% in system time. The 
> increase in memory consumption is a bit more than I would expect (why do I see 
> ~100k for a cache of 12k?), but acceptable nevertheless. I can see a slight 
> increase in user time (approx. 25%) for the kdump case, which could be 
> attributed to the cache overhead. I don't have any explanation for the 
> decreased user time for the ELF case, but it's consistent.
> 
> I also tried running makedumpfile with -d1. This results in long sequential 
> reads, so it's the worst case for a simple LRU-policy cache. The results are 
> too unstable to make a reliable measurement, but there seems to be a slight 
> performance hit. It is certainly less than 5% total time.
> 
> I think there are two reasons for that:
> 
> 1. We're copying file data twice for each page (once from the kernel page 
> cache to the process space, and once from the internal cache to the 
> destination).
> 2. Instead of reusing the same data location, we're rotating 8 different pages 
> (or even up to twice as much if the allocated space is neither continuous  nor 
> page-aligned). This stresses both for the CPU's L1 d-cache and the TLB a tiny 
> bit more. Note that in the /proc/vmcore case, the kernel sequentially maps all 
> physical memory of the crashed system, so every cache page may be evicted 
> before we get to using it again. This could explain why I observe an increase 
> in system time despite making less system calls.
> 
> There's a lot of things I could do to regain the old performance, if anybody 
> is concerned about the slight performance regression for this worst case. Just 
> let me know.
> 
> Second, I ran with the Xen hypervisor. Since dump levels greater than 1 don't 
> work, I ran with '-E -X -d1'. Even though this includes the inefficient page 
> walk described above, the improvement was immense.
> 
> no cache:
> 95.33user 657.18system 13:08.40elapsed 95%CPU (0avgtext+0avgdata 
> 5440maxresident)k
> 704inputs+6563856outputs (1major+388minor)pagefaults 0swaps
> 
> with cache:
> 61.14user 110.15system 3:24.24elapsed 83%CPU (0avgtext+0avgdata 
> 5584maxresident)k
> 2360inputs+6563872outputs (2major+396minor)pagefaults 0swaps
> 
> In short, almost 80% shorter total time.
> 
> Petr Tesarik
> SUSE Linux
> 
> _______________________________________________
> kexec mailing list
> kexec at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec