[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]
holzheu at linux.vnet.ibm.com
Fri Mar 13 09:19:57 PDT 2015
On Thu, 12 Mar 2015 16:38:22 +0100
Petr Tesarik <ptesarik at suse.cz> wrote:
> On Mon, 9 Mar 2015 17:08:58 +0100
> Michael Holzheu <holzheu at linux.vnet.ibm.com> wrote:
> > I counted the mmap and read system calls with "perf stat":
> > mmap unmap read = sum
> > ===============================================
> > mmap -d0 482 443 165 1090
> > mmap -d31 13454 13414 165 27033
> > non-mmap -d0 34 3 458917 458954
> > non-mmap -d31 34 3 74273 74310
> If your VM has 1.5 GiB of RAM, then the numbers for -d0 look
I have 1792 MiB RAM.
> For -d31, we should be able to do better than this
> by allocating more cache slots and improving the algorithm.
> I originally didn't deem it worth the effort, but seeing almost
> 30 times more mmaps than with -d0 may change my mind.
> > Here the actual results I got with "perf record":
> > $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f
> > Output of "perf report" for mmap case:
> > /* Most time spent for unmap in kernel */
> > 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma
> > 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range
> > 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page
> > /* Still some mmap overhead in makedumpfile readmem() */
> > 21.56% makedumpfile makedumpfile [.] readmem
> This number is interesting. Did you compile makedumpfile with
> optimizations? If yes, then this number probably includes some
> functions which were inlined.
Yes, I used the default Makefile (-O2) so most functions are inlined.
With -O0 I get the following:
15.35% makedumpfile libc-2.15.so [.] memcpy
2.14% makedumpfile makedumpfile [.] __exclude_unnecessary_pages
1.82% makedumpfile makedumpfile [.] test_bit
1.82% makedumpfile makedumpfile [.] set_bitmap_cyclic
1.32% makedumpfile makedumpfile [.] clear_bit_on_2nd_bitmap
1.32% makedumpfile makedumpfile [.] write_kdump_pages_cyclic
1.01% makedumpfile makedumpfile [.] is_on
0.88% makedumpfile makedumpfile [.] paddr_to_offset
0.75% makedumpfile makedumpfile [.] is_dumpable_cyclic
0.69% makedumpfile makedumpfile [.] exclude_range
0.63% makedumpfile makedumpfile [.] clear_bit_on_2nd_bitmap_for_kernel
0.63% makedumpfile [vdso] [.] __kernel_gettimeofday
0.57% makedumpfile makedumpfile [.] print_progress
0.50% makedumpfile makedumpfile [.] cache_search
> > 8.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic
> > Output of "perf report" for non-mmap case:
> > /* Time spent for sys_read (that needs also two copy operations on s390 :() */
> > 25.32% makedumpfile [kernel.kallsyms] [k] memcpy_real
> > 22.74% makedumpfile [kernel.kallsyms] [k] __copy_to_user
> > /* readmem() for read path is cheaper ? */
> > 13.49% makedumpfile makedumpfile [.] write_kdump_pages_cyclic
> > 4.53% makedumpfile makedumpfile [.] readmem
> Yes, much lower overhead of readmem is strange. For a moment I
> suspected wrong accounting of the page fault handler, but then I
> realized that for /proc/vmcore, all page table entries are created
> with the present bit set already, so there are no page faults...
Right, as said below, perhaps it is the HW caching issue.
> I haven't had time yet to set up a system for reproduction, but I'll
> try to identify what's eating up the CPU time in readmem().
> > I hope this analysis helps more than it confuses :-)
> > As a conclusion, we could think of mapping larger chunks
> > also for the fragmented case of -d 31 to reduce the amount
> > of mmap/munmap calls.
> I agree in general. Memory mapped through /proc/vmcore does not
> increase run-time memory requirements, because it only adds a mapping
> to the old kernel's memory.
At least you need the page table memory for the /proc/vmcore
> The only limiting factor is the virtual
> address space. On many architectures, this is no issue at all, and we
> could simply map the whole file at beginning. On some architectures,
> the virtual address space is smaller than possible physical RAM, so
> this approach would not work for them.
> > Another open question was why the mmap case consumes more CPU
> > time in readmem() than the read case. Our theory is that the
> > first memory access is slower because it is not in the HW
> > cache. For the mmap case userspace issues the first access (copy
> > to makdumpfile cache) and for the read case the kernel issues
> > the first access (memcpy_real/copy_to_user). Therefore the
> > cache miss is accounted to userspace for mmap and to kernel for
> > read.
> I have no idea how to measure this on s390. On x86_64 I would add some
> asm code to read TSC before and after the memory access instruction. I
> guess there is a similar counter on s390. Suggestions?
On s390 under LPAR we have hardware counters for cache misses:
# perf stat -e cpum_cf/L1D_PENALTY_CYCLES/,cpum_cf/PROBLEM_STATE_L1D_PENALTY_CYCLES/ ./makedumpfile -d31 /proc/vmcore /dev/null -f
Performance counter stats for './makedumpfile -d31 /proc/vmcore /dev/null -f':
# perf stat -e cpum_cf/L1D_PENALTY_CYCLES/,cpum_cf/PROBLEM_STATE_L1D_PENALTY_CYCLES/ ./makedumpfile -d31 /proc/vmcore /dev/null -f --non-mmap
Performance counter stats for './makedumpfile -d31 /proc/vmcore /dev/null -f --non-mmap':
- L1D_PENALTY_CYCLES: Cycles wasted due to L1 cache misses (kernel + userspace)
- PROBLEM_STATE_L1D_PENALTY_CYCLES: Cycles wasted due to L1 cache misses (userspace only)
So if I got it right, we see that for the mmap() case the cache
misses are almost all in userspace and for the read() case they
are in kernel.
Interestingly on that machine (4 GiB, LPAR and newer model) for
mmap() was faster also for -d 31:
$ time ./makedumpfile /proc/vmcore -d 31 /dev/null -f
$ time ./makedumpfile /proc/vmcore -d 31 /dev/null -f --non-mmap
> > And last but not least, perhaps on s390 we could replace
> > the bounce buffer used for memcpy_real()/copy_to_user() by
> > some more inteligent solution.
> Which would then improve the non-mmap times even more, right?
More information about the kexec