[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]
Michael Holzheu
holzheu at linux.vnet.ibm.com
Mon Mar 9 09:08:58 PDT 2015
Hello Petr,
With your patches I now used "perf record" and "perf stat"
to check where the CPU time is consumed for -d31 and -d0.
For -d31 the read case is better and for -d0 the mmap case
is better.
$ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f [--non-mmap]
user sys = total
=======================================
mmap 0.156 0.248 0.404
non-mmap 0.090 0.180 0.270
$ time ./makedumpfile -d 0 /proc/vmcore /dev/null -f [--non-mmap]
user sys = total
======================================
mmap 0.637 0.018 0.655
non-mmap 0.275 1.153 1.428
As already said, we think the reason is that for -d0 we issue
only a small number of mmap/munmap calls because the mmap
chunks are larger than the read chunks.
For -d31 memory is fragmented and we issue lots of small
mmap/munmap calls. Because munmap (at least on s390) is a
very expensive operation and we need two calls (mmap/munmap),
the mmap mode is slower that the read mode.
I counted the mmap and read system calls with "perf stat":
mmap unmap read = sum
===============================================
mmap -d0 482 443 165 1090
mmap -d31 13454 13414 165 27033
non-mmap -d0 34 3 458917 458954
non-mmap -d31 34 3 74273 74310
Here the actual results I got with "perf record":
$ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f
Output of "perf report" for mmap case:
/* Most time spent for unmap in kernel */
29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma
9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range
8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page
/* Still some mmap overhead in makedumpfile readmem() */
21.56% makedumpfile makedumpfile [.] readmem
8.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic
Output of "perf report" for non-mmap case:
/* Time spent for sys_read (that needs also two copy operations on s390 :() */
25.32% makedumpfile [kernel.kallsyms] [k] memcpy_real
22.74% makedumpfile [kernel.kallsyms] [k] __copy_to_user
/* readmem() for read path is cheaper ? */
13.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic
4.53% makedumpfile makedumpfile [.] readmem
$ time ./makedumpfile -d 0 /proc/vmcore /dev/null -f
Output of "perf report" for mmap case:
/* Almost no kernel time because we issue very view system calls */
0.61% makedumpfile [kernel.kallsyms] [k] unmap_single_vma
0.61% makedumpfile [kernel.kallsyms] [k] sysc_do_svc
/* Almost all time consumed in user space */
84.64% makedumpfile makedumpfile [.] readmem
8.82% makedumpfile makedumpfile [.] write_cache
Output of "perf report" for non-mmap case:
/* Time spent for sys_read (that needs also two copy operations on s390) */
31.50% makedumpfile [kernel.kallsyms] [k] memcpy_real
29.33% makedumpfile [kernel.kallsyms] [k] __copy_to_user
/* Very little user space time */
3.87% makedumpfile makedumpfile [.] write_cache
3.82% makedumpfile makedumpfile [.] readmem
I hope this analysis helps more than it confuses :-)
As a conclusion, we could think of mapping larger chunks
also for the fragmented case of -d 31 to reduce the amount
of mmap/munmap calls.
Another open question was why the mmap case consumes more CPU
time in readmem() than the read case. Our theory is that the
first memory access is slower because it is not in the HW
cache. For the mmap case userspace issues the first access (copy
to makdumpfile cache) and for the read case the kernel issues
the first access (memcpy_real/copy_to_user). Therefore the
cache miss is accounted to userspace for mmap and to kernel for
read.
And last but not least, perhaps on s390 we could replace
the bounce buffer used for memcpy_real()/copy_to_user() by
some more inteligent solution.
Best Regards
Michael
On Fri, 6 Mar 2015 15:03:12 +0100
Petr Tesarik <ptesarik at suse.cz> wrote:
> Because all pages must go into the cache, data is unnecessarily
> copied from mmapped regions to cache. Avoid this copying by storing
> the mmapped regions directly in the cache.
>
> First, the cache code needs a clean up clarification of the concept,
> especially the meaning of the pending list (allocated cache entries
> whose content is not yet valid).
>
> Second, the cache must be able to handle differently sized objects
> so that it can store individual pages as well as mmapped regions.
>
> Last, the cache eviction code must be extended to allow either
> reusing the read buffer or unmapping the region.
>
> Changelog:
> v2: cache cleanup _and_ actuall mmap implementation
> v1: only the cache cleanup
>
> Petr Tesarik (8):
> cache: get rid of search loop in cache_add()
> cache: allow to return a page to the pool
> cache: do not allocate from the pending list
> cache: add hit/miss statistics to the final report
> cache: allocate buffers in one big chunk
> cache: allow arbitrary size of cache entries
> cache: store mapped regions directly in the cache
> cleanup: remove unused page_is_fractional
>
> cache.c | 81 +++++++++++++++++----------------
> cache.h | 16 +++++--
> elf_info.c | 16 -------
> elf_info.h | 2 -
> makedumpfile.c | 138 ++++++++++++++++++++++++++++++++++-----------------------
> 5 files changed, 138 insertions(+), 115 deletions(-)
>
More information about the kexec
mailing list