[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]

Thu Mar 12 08:38:22 PDT 2015

On Mon, 9 Mar 2015 17:08:58 +0100
Michael Holzheu <holzheu at linux.vnet.ibm.com> wrote:

> Hello Petr,
> 
> With your patches I now used "perf record" and "perf stat"
> to check where the CPU time is consumed for -d31 and -d0.
> 
> For -d31 the read case is better and for -d0 the mmap case
> is better.
> 
>[...]
> 
> As already said, we think the reason is that for -d0 we issue
> only a small number of mmap/munmap calls because the mmap
> chunks are larger than the read chunks.

This is very likely.

> For -d31 memory is fragmented and we issue lots of small
> mmap/munmap calls. Because munmap (at least on s390) is a
> very expensive operation and we need two calls (mmap/munmap),
> the mmap mode is slower that the read mode.

Yes. And it may provide an explanation why my patch set improves the
situation. By keeping the mmapped regions in the cache, rather than
individual pages copied out of the mmap region, the cache is in fact
much larger, resulting in less mmap/munmap syscalls.

> I counted the mmap and read system calls with "perf stat":
> 
>                      mmap   unmap   read =    sum
>   ===============================================
>   mmap -d0            482     443    165     1090          
>   mmap -d31         13454   13414    165    27033 
>   non-mmap -d0         34       3 458917   458954 
>   non-mmap -d31        34       3  74273    74310

If your VM has 1.5 GiB of RAM, then the numbers for -d0 look
reasonable. For -d31, we should be able to do better than this
by allocating more cache slots and improving the algorithm.
I originally didn't deem it worth the effort, but seeing almost
30 times more mmaps than with -d0 may change my mind.

> Here the actual results I got with "perf record":
> 
> $ time ./makedumpfile  -d 31 /proc/vmcore  /dev/null -f
> 
>   Output of "perf report" for mmap case:
> 
>    /* Most time spent for unmap in kernel */
>    29.75%  makedumpfile  [kernel.kallsyms]  [k] unmap_single_vma
>     9.84%  makedumpfile  [kernel.kallsyms]  [k] remap_pfn_range
>     8.49%  makedumpfile  [kernel.kallsyms]  [k] vm_normal_page
> 
>    /* Still some mmap overhead in makedumpfile readmem() */
>    21.56%  makedumpfile  makedumpfile       [.] readmem

This number is interesting. Did you compile makedumpfile with
optimizations? If yes, then this number probably includes some
functions which were inlined.

>     8.49%  makedumpfile  makedumpfile       [.] write_kdump_pages_cyclic
> 
>   Output of "perf report" for non-mmap case:
> 
>    /* Time spent for sys_read (that needs also two copy operations on s390 :() */
>    25.32%  makedumpfile  [kernel.kallsyms]  [k] memcpy_real
>    22.74%  makedumpfile  [kernel.kallsyms]  [k] __copy_to_user
> 
>    /* readmem() for read path is cheaper ? */
>    13.49%  makedumpfile  makedumpfile       [.] write_kdump_pages_cyclic
>     4.53%  makedumpfile  makedumpfile       [.] readmem

Yes, much lower overhead of readmem is strange. For a moment I
suspected wrong accounting of the page fault handler, but then I
realized that for /proc/vmcore, all page table entries are created
with the present bit set already, so there are no page faults...

I haven't had time yet to set up a system for reproduction, but I'll
try to identify what's eating up the CPU time in readmem().

>[...]
> I hope this analysis helps more than it confuses :-)
> 
> As a conclusion, we could think of mapping larger chunks
> also for the fragmented case of -d 31 to reduce the amount
> of mmap/munmap calls.

I agree in general. Memory mapped through /proc/vmcore does not
increase run-time memory requirements, because it only adds a mapping
to the old kernel's memory. The only limiting factor is the virtual
address space. On many architectures, this is no issue at all, and we
could simply map the whole file at beginning. On some architectures,
the virtual address space is smaller than possible physical RAM, so
this approach would not work for them.

> Another open question was why the mmap case consumes more CPU
> time in readmem() than the read case. Our theory is that the
> first memory access is slower because it is not in the HW
> cache. For the mmap case userspace issues the first access (copy
> to makdumpfile cache) and for the read case the kernel issues
> the first access (memcpy_real/copy_to_user). Therefore the
> cache miss is accounted to userspace for mmap and to kernel for
> read.

I have no idea how to measure this on s390. On x86_64 I would add some
asm code to read TSC before and after the memory access instruction. I
guess there is a similar counter on s390. Suggestions?

> And last but not least, perhaps on s390 we could replace
> the bounce buffer used for memcpy_real()/copy_to_user() by
> some more inteligent solution.

Which would then improve the non-mmap times even more, right?

Petr T