[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]

Fri Mar 13 01:04:58 PDT 2015

On Fri, 13 Mar 2015 04:10:22 +0000
Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote:

> Hello,
> 
> (Note: my email address has changed.)
> 
> In x86_64, calling ioremap/iounmap per page in copy_oldmem_page()
> causes big performance degradation, so mmap() was introduced on
> /proc/vmcore. However, there is no big difference between read() and
> mmap() in s390 since it doesn't need ioremap/iounmap in copy_oldmem_page(),
> so other issues have been revealed, right?
> 
> [...]
> 
> >> I counted the mmap and read system calls with "perf stat":
> >>
> >>                      mmap   unmap   read =    sum
> >>   ===============================================
> >>   mmap -d0            482     443    165     1090
> >>   mmap -d31         13454   13414    165    27033
> >>   non-mmap -d0         34       3 458917   458954
> >>   non-mmap -d31        34       3  74273    74310
> >
> >If your VM has 1.5 GiB of RAM, then the numbers for -d0 look
> >reasonable. For -d31, we should be able to do better than this
> >by allocating more cache slots and improving the algorithm.
> >I originally didn't deem it worth the effort, but seeing almost
> >30 times more mmaps than with -d0 may change my mind.
> 
> Are you going to do it as v3 patch?

No. Tuning the caching algorithm requires a lot of research. I plan to
do it, but testing it with all scenarios (and tuning the algorithm
based on the results) will probably take weeks. I don't think it makes
sense to wait for it.

> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if
> you don't think updating it.

Since v2 already brings some performance gain, I appreciate it if you
can adopt it for v1.5.8.

Thank you very much,
Petr Tesarik

> Thanks
> Atsushi Kumagai
> 
> >
> >> Here the actual results I got with "perf record":
> >>
> >> $ time ./makedumpfile  -d 31 /proc/vmcore  /dev/null -f
> >>
> >>   Output of "perf report" for mmap case:
> >>
> >>    /* Most time spent for unmap in kernel */
> >>    29.75%  makedumpfile  [kernel.kallsyms]  [k] unmap_single_vma
> >>     9.84%  makedumpfile  [kernel.kallsyms]  [k] remap_pfn_range
> >>     8.49%  makedumpfile  [kernel.kallsyms]  [k] vm_normal_page
> >>
> >>    /* Still some mmap overhead in makedumpfile readmem() */
> >>    21.56%  makedumpfile  makedumpfile       [.] readmem
> >
> >This number is interesting. Did you compile makedumpfile with
> >optimizations? If yes, then this number probably includes some
> >functions which were inlined.
> >
> >>     8.49%  makedumpfile  makedumpfile       [.]
> >> write_kdump_pages_cyclic
> >>
> >>   Output of "perf report" for non-mmap case:
> >>
> >>    /* Time spent for sys_read (that needs also two copy operations
> >> on s390 :() */ 25.32%  makedumpfile  [kernel.kallsyms]  [k]
> >> memcpy_real 22.74%  makedumpfile  [kernel.kallsyms]  [k]
> >> __copy_to_user
> >>
> >>    /* readmem() for read path is cheaper ? */
> >>    13.49%  makedumpfile  makedumpfile       [.]
> >> write_kdump_pages_cyclic 4.53%  makedumpfile  makedumpfile
> >> [.] readmem
> >
> >Yes, much lower overhead of readmem is strange. For a moment I
> >suspected wrong accounting of the page fault handler, but then I
> >realized that for /proc/vmcore, all page table entries are created
> >with the present bit set already, so there are no page faults...
> >
> >I haven't had time yet to set up a system for reproduction, but I'll
> >try to identify what's eating up the CPU time in readmem().
> >
> >>[...]
> >> I hope this analysis helps more than it confuses :-)
> >>
> >> As a conclusion, we could think of mapping larger chunks
> >> also for the fragmented case of -d 31 to reduce the amount
> >> of mmap/munmap calls.
> >
> >I agree in general. Memory mapped through /proc/vmcore does not
> >increase run-time memory requirements, because it only adds a mapping
> >to the old kernel's memory. The only limiting factor is the virtual
> >address space. On many architectures, this is no issue at all, and we
> >could simply map the whole file at beginning. On some architectures,
> >the virtual address space is smaller than possible physical RAM, so
> >this approach would not work for them.
> >
> >> Another open question was why the mmap case consumes more CPU
> >> time in readmem() than the read case. Our theory is that the
> >> first memory access is slower because it is not in the HW
> >> cache. For the mmap case userspace issues the first access (copy
> >> to makdumpfile cache) and for the read case the kernel issues
> >> the first access (memcpy_real/copy_to_user). Therefore the
> >> cache miss is accounted to userspace for mmap and to kernel for
> >> read.
> >
> >I have no idea how to measure this on s390. On x86_64 I would add
> >some asm code to read TSC before and after the memory access
> >instruction. I guess there is a similar counter on s390. Suggestions?
> >
> >> And last but not least, perhaps on s390 we could replace
> >> the bounce buffer used for memcpy_real()/copy_to_user() by
> >> some more inteligent solution.
> >
> >Which would then improve the non-mmap times even more, right?
> >
> >Petr T