[PATCH v2 0/8] Handle mmaped regions in cache [more analysis]
Atsushi Kumagai
ats-kumagai at wm.jp.nec.com
Sun Mar 15 22:14:25 PDT 2015
>On Fri, 13 Mar 2015 04:10:22 +0000
>Atsushi Kumagai <ats-kumagai at wm.jp.nec.com> wrote:
>
>> Hello,
>>
>> (Note: my email address has changed.)
>>
>> In x86_64, calling ioremap/iounmap per page in copy_oldmem_page()
>> causes big performance degradation, so mmap() was introduced on
>> /proc/vmcore. However, there is no big difference between read() and
>> mmap() in s390 since it doesn't need ioremap/iounmap in copy_oldmem_page(),
>> so other issues have been revealed, right?
>>
>> [...]
>>
>> >> I counted the mmap and read system calls with "perf stat":
>> >>
>> >> mmap unmap read = sum
>> >> ===============================================
>> >> mmap -d0 482 443 165 1090
>> >> mmap -d31 13454 13414 165 27033
>> >> non-mmap -d0 34 3 458917 458954
>> >> non-mmap -d31 34 3 74273 74310
>> >
>> >If your VM has 1.5 GiB of RAM, then the numbers for -d0 look
>> >reasonable. For -d31, we should be able to do better than this
>> >by allocating more cache slots and improving the algorithm.
>> >I originally didn't deem it worth the effort, but seeing almost
>> >30 times more mmaps than with -d0 may change my mind.
>>
>> Are you going to do it as v3 patch?
>
>No. Tuning the caching algorithm requires a lot of research. I plan to
>do it, but testing it with all scenarios (and tuning the algorithm
>based on the results) will probably take weeks. I don't think it makes
>sense to wait for it.
>
>> I'm going to release v1.5.8 soon, so I'll adopt v2 patch if
>> you don't think updating it.
>
>Since v2 already brings some performance gain, I appreciate it if you
>can adopt it for v1.5.8.
Ok, but unfortunately I got some error log during my test like below:
$ ./makedumpfile -d31 /tmp/vmcore ./dumpfile.d31
Excluding free pages : [ 0.0 %] /
reset_bitmap_of_free_pages: The free list is broken.
reset_bitmap_of_free_pages: The free list is broken.
makedumpfile Failed.
$
All of errors are the same as the above at least in my test.
I clarified that [PATCH v2 7/8] causes this by git bisect,
but the root cause is under investigation.
4064 if (!readmem(VADDR, curr+OFFSET(list_head.prev),
4065 &curr_prev, sizeof curr_prev)) { // get wrong value here
4066 ERRMSG("Can't get prev list_head.\n");
4067 return FALSE;
4068 }
4069 if (previous != curr_prev) {
4070 ERRMSG("The free list is broken.\n");
4071 retcd = ANALYSIS_FAILED;
4072 return FALSE;
4073 }
Thanks
Atsushi Kumagai
>Thank you very much,
>Petr Tesarik
>
>> Thanks
>> Atsushi Kumagai
>>
>> >
>> >> Here the actual results I got with "perf record":
>> >>
>> >> $ time ./makedumpfile -d 31 /proc/vmcore /dev/null -f
>> >>
>> >> Output of "perf report" for mmap case:
>> >>
>> >> /* Most time spent for unmap in kernel */
>> >> 29.75% makedumpfile [kernel.kallsyms] [k] unmap_single_vma
>> >> 9.84% makedumpfile [kernel.kallsyms] [k] remap_pfn_range
>> >> 8.49% makedumpfile [kernel.kallsyms] [k] vm_normal_page
>> >>
>> >> /* Still some mmap overhead in makedumpfile readmem() */
>> >> 21.56% makedumpfile makedumpfile [.] readmem
>> >
>> >This number is interesting. Did you compile makedumpfile with
>> >optimizations? If yes, then this number probably includes some
>> >functions which were inlined.
>> >
>> >> 8.49% makedumpfile makedumpfile [.]
>> >> write_kdump_pages_cyclic
>> >>
>> >> Output of "perf report" for non-mmap case:
>> >>
>> >> /* Time spent for sys_read (that needs also two copy operations
>> >> on s390 :() */ 25.32% makedumpfile [kernel.kallsyms] [k]
>> >> memcpy_real 22.74% makedumpfile [kernel.kallsyms] [k]
>> >> __copy_to_user
>> >>
>> >> /* readmem() for read path is cheaper ? */
>> >> 13.49% makedumpfile makedumpfile [.]
>> >> write_kdump_pages_cyclic 4.53% makedumpfile makedumpfile
>> >> [.] readmem
>> >
>> >Yes, much lower overhead of readmem is strange. For a moment I
>> >suspected wrong accounting of the page fault handler, but then I
>> >realized that for /proc/vmcore, all page table entries are created
>> >with the present bit set already, so there are no page faults...
>> >
>> >I haven't had time yet to set up a system for reproduction, but I'll
>> >try to identify what's eating up the CPU time in readmem().
>> >
>> >>[...]
>> >> I hope this analysis helps more than it confuses :-)
>> >>
>> >> As a conclusion, we could think of mapping larger chunks
>> >> also for the fragmented case of -d 31 to reduce the amount
>> >> of mmap/munmap calls.
>> >
>> >I agree in general. Memory mapped through /proc/vmcore does not
>> >increase run-time memory requirements, because it only adds a mapping
>> >to the old kernel's memory. The only limiting factor is the virtual
>> >address space. On many architectures, this is no issue at all, and we
>> >could simply map the whole file at beginning. On some architectures,
>> >the virtual address space is smaller than possible physical RAM, so
>> >this approach would not work for them.
>> >
>> >> Another open question was why the mmap case consumes more CPU
>> >> time in readmem() than the read case. Our theory is that the
>> >> first memory access is slower because it is not in the HW
>> >> cache. For the mmap case userspace issues the first access (copy
>> >> to makdumpfile cache) and for the read case the kernel issues
>> >> the first access (memcpy_real/copy_to_user). Therefore the
>> >> cache miss is accounted to userspace for mmap and to kernel for
>> >> read.
>> >
>> >I have no idea how to measure this on s390. On x86_64 I would add
>> >some asm code to read TSC before and after the memory access
>> >instruction. I guess there is a similar counter on s390. Suggestions?
>> >
>> >> And last but not least, perhaps on s390 we could replace
>> >> the bounce buffer used for memcpy_real()/copy_to_user() by
>> >> some more inteligent solution.
>> >
>> >Which would then improve the non-mmap times even more, right?
>> >
>> >Petr T
More information about the kexec
mailing list