[PATCH v2 00/10] makedumpfile: parallel processing

Wed Jul 8 01:10:52 PDT 2015

>Hello Atsushi Kumagai,
>
>I have tried a lot, and I think the big performance degradation only
>occurs in special CPU.
>I thought about two reasons, and I need your help to confirm which is
>the real one.
>
>The following tests will also be OK by using dumpfile instead of /proc/vmcore
>
>Test 1: distinguish whether it is resulted by multi-threads.
>apply patch: test1
>command1: ./makedumpfile -c /proc/vmcore vmcore --num-threads 1
>command2: ./makedumpfile -c /proc/vmcore vmcore --num-threads 8
>
>better to do some test in -l too.
>command1: ./makedumpfile -l /proc/vmcore vmcore
>command2: ./makedumpfile -l /proc/vmcore vmcore --num-threads 1
>command3: ./makedumpfile -l /proc/vmcore vmcore --num-threads 8
>
>Test 2: distinguish whether it is resulted by doing compress in thread
>2.1:
>	apply patch: test2.1
>	command: ./makedumpfile -c /proc/vmcore vmcore --num-threads 1
>2.2:
>	apply patch: test2.2
>	command: ./makedumpfile -c /proc/vmcore vmcore --num-threads 1
>
>Thanks a lot.
>
>BTW, could you show me the cpu name, zlib version and glibc version ?

Here is the my environments, I tested on 1st kernel on a physical
machine from this time:

  CPU:   Intel(R) Xeon(R) CPU E7- 8870  @ 2.40GHz (80 cores)
  zlib:  1.2.3-29
  glibc: 2.12-1.149

Here is the summary of the result, the whole logs are attached:

Test1:
  | threads | compress time | exec time |
  |    1    |     82.75     |   95.21   |
  |    8    |   34.1~34.8   |   40.32   |

Test1(lzo):
  | threads | compress time | exec time |
  |    -    |       -       |   30.33   |
  |    1    |       -       |   33.10   |
  |    8    |       -       |   28.77   |

Test2-1:
  | threads | compress time | exec time |
  |    1    |     76.12     |   82.13   |

Test2-2:
  | threads | compress time | exec time |
  |    1    |     41.97     |   51.46   |

There are better results than on virtual machines, but still worse than
without creating thread.

Additionally, I collected performance logs by perf for debugging as below:

Test2-1:
 38.93%  makedumpfile-2-  libz.so.1.2.3                  [.] 0x0000000000006e30   // compress2(?)
  5.16%  makedumpfile-2-  [kernel.kallsyms]              [k] page_fault
  5.02%  makedumpfile-2-  [kernel.kallsyms]              [k] clear_page_c
  4.96%  makedumpfile-2-  libc-2.12.so                   [.] __memset_sse2
  4.64%  makedumpfile-2-  [kernel.kallsyms]              [k] tick_program_event
  3.77%  makedumpfile-2-  libz.so.1.2.3                  [.] adler32
  2.08%  makedumpfile-2-  libc-2.12.so                   [.] memcpy
  1.95%  makedumpfile-2-  [kernel.kallsyms]              [k] __alloc_pages_nodemask
  1.54%  makedumpfile-2-  [kernel.kallsyms]              [k] get_page_from_freelist
  1.53%  makedumpfile-2-  [kernel.kallsyms]              [k] tick_dev_program_event
  1.32%  makedumpfile-2-  [kernel.kallsyms]              [k] __mem_cgroup_commit_charge
  1.08%  makedumpfile-2-  [kernel.kallsyms]              [k] hrtimer_interrupt
  0.94%  makedumpfile-2-  [kernel.kallsyms]              [k] release_pages
  0.85%  makedumpfile-2-  [kernel.kallsyms]              [k] alloc_pages_vma
  ...

Test2-2:
 66.46%  makedumpfile-2-  libz.so.1.2.3                  [.] 0x000000000000564e   // compress2(?)
  6.71%  makedumpfile-2-  libz.so.1.2.3                  [.] adler32
  5.56%  makedumpfile-2-  libc-2.12.so                   [.] __memset_sse2
  3.44%  makedumpfile-2-  libc-2.12.so                   [.] memcpy
  2.85%  makedumpfile-2-  [kernel.kallsyms]              [k] tick_dev_program_event
  1.30%  makedumpfile-2-  libz.so.1.2.3                  [.] _tr_flush_block
  0.95%  makedumpfile-2-  [kernel.kallsyms]              [k] tick_program_event
  0.92%  makedumpfile-2-  libz.so.1.2.3                  [.] _tr_init
  0.69%  makedumpfile-2-  [kernel.kallsyms]              [k] hrtimer_interrupt
  0.63%  makedumpfile-2-  [kernel.kallsyms]              [k] page_fault
  0.34%  makedumpfile-2-  libpthread-2.12.so             [.] __lll_unlock_wake
  0.33%  makedumpfile-2-  libc-2.12.so                   [.] _int_malloc
  0.33%  makedumpfile-2-  libc-2.12.so                   [.] _int_free

In the case of Test2-1(using pthread_create), longer cpu time was spent for some
kernel functions like page_fault() than Test2-2(calling kdump_thread_function_cyclic()
directly), it may be a one of the cause of the degradation. I haven't found the reason
yet, this is just for your information.

Thanks
Atsushi Kumagai

>--
>Thanks
>Zhou Wenjian
>
>On 06/30/2015 05:06 PM, Atsushi Kumagai wrote:
>>> On 06/26/2015 03:49 PM, Atsushi Kumagai wrote:
>>>> I attached 5 processors to the VM and I confirmed that all threads
>>>> consumed full cpu time by top(1) on the host:
>>>>
>>>>       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>     17614 qemu      20   0 5792m 4.9g 5652 R 435.1 72.8  29:02.17 qemu-kvm
>>>>
>>>> So I think the performance must be improved...
>>>
>>> Since I can't get that result in all machines here, could you test it with the patch:time
>>> and show me the output?
>>> Using "./makedumpfile -c --num-threads 4 /proc/vmcore dumpfile" is OK.
>>>
>>> The attachment is the patch time.
>>
>> Here is the result:
>>
>> / # makedumpfile -c --num-threads 4 /proc/vmcore /mnt/dumpfile
>> Copying data                       : [100.0 %] |
>> Copying data                       : [100.0 %] \
>>
>> The dumpfile is saved to /mnt/dumpfile.
>>
>> makedumpfile Completed.
>> lock time: 310s935500us
>> write time: 3s970037us
>> hit time: 6s316043us
>> find time: 317s926654us
>> loop_time: 37s321800us
>> thread consume_time: 0s0us
>> thread timea: 0s0us
>> thread timeb: 0s0us
>> read_time[0]: 8s637011us
>> lock_current_time[0]: 0s284428us
>> found_time[0]: 60s366795us
>> lock_consumed_time[0]: 2s782596us
>> compress_time[0]: 301s427073us
>> read_time[1]: 8s435914us
>> lock_current_time[1]: 0s271680us
>> found_time[1]: 60s329026us
>> lock_consumed_time[1]: 2s849061us
>> compress_time[1]: 302s98620us
>> read_time[2]: 8s380550us
>> lock_current_time[2]: 0s270388us
>> found_time[2]: 60s209376us
>> lock_consumed_time[2]: 3s297574us
>> compress_time[2]: 301s486768us
>> read_time[3]: 8s550662us
>> lock_current_time[3]: 0s278997us
>> found_time[3]: 60s476702us
>> lock_consumed_time[3]: 3s49184us
>> compress_time[3]: 301s718390us
>> count1: 172
>> count2: 70921401
>> count3: 0
>> count4: 0
>> count5: 0
>> count6: 0
>> count7: 0
>> exec time: 380s125494us
>>
>>
>> BTW, I fixed a small mistake before testing like:
>>
>> - write_time = (write2.tv_sec - write1.tv_sec) * 1000000 + (write2.tv_usec - write1.tv_usec);
>> + write_time += (write2.tv_sec - write1.tv_sec) * 1000000 + (write2.tv_usec - write1.tv_usec);
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1.log
Type: application/octet-stream
Size: 8195 bytes
Desc: test1.log
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20150708/53d4328c/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1-lzo.log
Type: application/octet-stream
Size: 6548 bytes
Desc: test1-lzo.log
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20150708/53d4328c/attachment-0005.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2-1.log
Type: application/octet-stream
Size: 404 bytes
Desc: test2-1.log
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20150708/53d4328c/attachment-0006.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test2-2.log
Type: application/octet-stream
Size: 404 bytes
Desc: test2-2.log
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20150708/53d4328c/attachment-0007.obj>