[PATCH RFC 00/11] makedumpfile: parallel processing

Tue Dec 22 00:32:25 PST 2015

Chao,

From: Chao Fan <cfan at redhat.com>
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
Date: Thu, 10 Dec 2015 05:54:28 -0500

> 
> 
> ----- Original Message -----
>> From: "Wenjian Zhou/周文剑" <zhouwj-fnst at cn.fujitsu.com>
>> To: "Chao Fan" <cfan at redhat.com>
>> Cc: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>, kexec at lists.infradead.org
>> Sent: Thursday, December 10, 2015 6:32:32 PM
>> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> 
>> On 12/10/2015 05:58 PM, Chao Fan wrote:
>> >
>> >
>> > ----- Original Message -----
>> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst at cn.fujitsu.com>
>> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>
>> >> Cc: kexec at lists.infradead.org
>> >> Sent: Thursday, December 10, 2015 5:36:47 PM
>> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> >>
>> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
>> >>>> Hello Kumagai,
>> >>>>
>> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
>> >>>>> Hello, Zhou
>> >>>>>
>> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/周文剑" wrote:
>> >>>>>>>> I think there is no problem if other test results are as expected.
>> >>>>>>>>
>> >>>>>>>> --num-threads mainly reduces the time of compressing.
>> >>>>>>>> So for lzo, it can't do much help at most of time.
>> >>>>>>>
>> >>>>>>> Seems the help of --num-threads does not say it exactly:
>> >>>>>>>
>> >>>>>>>       [--num-threads THREADNUM]:
>> >>>>>>>           Using multiple threads to read and compress data of each
>> >>>>>>>           page
>> >>>>>>>           in parallel.
>> >>>>>>>           And it will reduces time for saving DUMPFILE.
>> >>>>>>>           This feature only supports creating DUMPFILE in
>> >>>>>>>           kdump-comressed format from
>> >>>>>>>           VMCORE in kdump-compressed format or elf format.
>> >>>>>>>
>> >>>>>>> Lzo is also a compress method, it should be mentioned that
>> >>>>>>> --num-threads only
>> >>>>>>> supports zlib compressed vmcore.
>> >>>>>>>
>> >>>>>>
>> >>>>>> Sorry, it seems that something I said is not so clear.
>> >>>>>> lzo is also supported. Since lzo compresses data at a high speed, the
>> >>>>>> improving of the performance is not so obvious at most of time.
>> >>>>>>
>> >>>>>>> Also worth to mention about the recommended -d value for this
>> >>>>>>> feature.
>> >>>>>>>
>> >>>>>>
>> >>>>>> Yes, I think it's worth. I forgot it.
>> >>>>>
>> >>>>> I saw your patch, but I think I should confirm what is the problem
>> >>>>> first.
>> >>>>>
>> >>>>>> However, when "-d 31" is specified, it will be worse.
>> >>>>>> Less than 50 buffers are used to cache the compressed page.
>> >>>>>> And even the page has been filtered, it will also take a buffer.
>> >>>>>> So if "-d 31" is specified, the filtered page will use a lot
>> >>>>>> of buffers. Then the page which needs to be compressed can't
>> >>>>>> be compressed parallel.
>> >>>>>
>> >>>>> Could you explain why compression will not be parallel in more detail ?
>> >>>>> Actually the buffers are used also for filtered pages, it sounds
>> >>>>> inefficient.
>> >>>>> However, I don't understand why it prevents parallel compression.
>> >>>>>
>> >>>>
>> >>>> Think about this, in a huge memory, most of the page will be filtered,
>> >>>> and
>> >>>> we have 5 buffers.
>> >>>>
>> >>>> page1       page2      page3     page4     page5      page6       page7
>> >>>> .....
>> >>>> [buffer1]   [2]        [3]       [4]       [5]
>> >>>> unfiltered  filtered   filtered  filtered  filtered   unfiltered
>> >>>> filtered
>> >>>>
>> >>>> Since filtered page will take a buffer, when compressing page1,
>> >>>> page6 can't be compressed at the same time.
>> >>>> That why it will prevent parallel compression.
>> >>>
>> >>> Thanks for your explanation, I understand.
>> >>> This is just an issue of the current implementation, there is no
>> >>> reason to stand this restriction.
>> >>>
>> >>>>> Further, according to Chao's benchmark, there is a big performance
>> >>>>> degradation even if the number of thread is 1. (58s vs 240s)
>> >>>>> The current implementation seems to have some problems, we should
>> >>>>> solve them.
>> >>>>>
>> >>>>
>> >>>> If "-d 31" is specified, on the one hand we can't save time by
>> >>>> compressing
>> >>>> parallel, on the other hand we will introduce some extra work by adding
>> >>>> "--num-threads". So it is obvious that it will have a performance
>> >>>> degradation.
>> >>>
>> >>> Sure, there must be some overhead due to "some extra work"(e.g. exclusive
>> >>> lock),
>> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still sounds
>> >>> too slow, the degradation is too big to be called "some extra work".
>> >>>
>> >>> Both --num-threads=0 and --num-threads=1 are serial processing,
>> >>> the above "buffer fairness issue" will not be related to this
>> >>> degradation.
>> >>> What do you think what make this degradation ?
>> >>>
>> >>
>> >> I can't get such result at this moment, so I can't do some further
>> >> investigation
>> >> right now. I guess it may be caused by the underlying implementation of
>> >> pthread.
>> >> I reviewed the test result of the patch v2 and found in different
>> >> machines,
>> >> the results are quite different.
>> >
>> > Hi Zhou Wenjian,
>> >
>> > I have done more tests in another machine with 128G memory, and get the
>> > result:
>> >
>> > the size of vmcore is 300M in "-d 31"
>> > makedumpfile -l --message-level 1 -d 31:
>> > time: 8.6s      page-faults: 2272
>> >
>> > makedumpfile -l --num-threads 1 --message-level 1 -d 31:
>> > time: 28.1s     page-faults: 2359
>> >
>> >
>> > and the size of vmcore is 2.6G in "-d 0".
>> > In this machine, I get the same result as yours:
>> >
>> >
>> > makedumpfile -c --message-level 1 -d 0:
>> > time: 597s      page-faults: 2287
>> >
>> > makedumpfile -c --num-threads 1 --message-level 1 -d 0:
>> > time: 602s      page-faults: 2361
>> >
>> > makedumpfile -c --num-threads 2 --message-level 1 -d 0:
>> > time: 337s      page-faults: 2397
>> >
>> > makedumpfile -c --num-threads 4 --message-level 1 -d 0:
>> > time: 175s      page-faults: 2461
>> >
>> > makedumpfile -c --num-threads 8 --message-level 1 -d 0:
>> > time: 103s      page-faults: 2611
>> >
>> >
>> > But the machine of my first test is not under my control, should I wait for
>> > the first machine to do more tests?
>> > If there are still some problems in my tests, please tell me.
>> >
>> 
>> Thanks a lot for your test, it seems that there is nothing wrong.
>> And I haven't got any idea about more tests...
>> 
>> Could you provide the information of your cpu ?
>> I will do some further investigation later.
>> 
> 
> OK, of course, here is the information of cpu:
> 
> # lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                48
> On-line CPU(s) list:   0-47
> Thread(s) per core:    1
> Core(s) per socket:    6
> Socket(s):             8
> NUMA node(s):          8
> Vendor ID:             AuthenticAMD
> CPU family:            16
> Model:                 8
> Model name:            Six-Core AMD Opteron(tm) Processor 8439 SE
> Stepping:              0
> CPU MHz:               2793.040
> BogoMIPS:              5586.22
> Virtualization:        AMD-V
> L1d cache:             64K
> L1i cache:             64K
> L2 cache:              512K
> L3 cache:              5118K
> NUMA node0 CPU(s):     0,8,16,24,32,40
> NUMA node1 CPU(s):     1,9,17,25,33,41
> NUMA node2 CPU(s):     2,10,18,26,34,42
> NUMA node3 CPU(s):     3,11,19,27,35,43
> NUMA node4 CPU(s):     4,12,20,28,36,44
> NUMA node5 CPU(s):     5,13,21,29,37,45
> NUMA node6 CPU(s):     6,14,22,30,38,46
> NUMA node7 CPU(s):     7,15,23,31,39,47

This CPU assignment on NUMA nodes looks interesting. Is it possible
that this affects performance of makedumpfile? This is just a guess.

Could you check whether the performance gets imporoved if you run each
thread on the same NUMA node? For example:

  # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0

Also, if this were cause of this performance degradation, we might
need to extend nr_cpus= kernel option to choose NUMA nodes we use;
though, we might already be able to do that in combination with other
kernel features.

> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate npt lbrv svm_lock nrip_save pausefilter vmmcall
> 
>> But I still believe it's better not to use "-l -d 31" and "--num-threads"
>> at the same time, though it's very strange that the performance
>> degradation is so big.
>> 
>> --
>> Thanks
>> Zhou
>> 
>> > Thanks,
>> > Chao Fan
>> >
>> >
>> >>
>> >> It seems that I can get almost the same result of Chao from "PRIMEQUEST
>> >> 1800E".
>> >>
>> >> ###################################
>> >> - System: PRIMERGY RX300 S6
>> >> - CPU: Intel(R) Xeon(R) CPU x5660
>> >> - memory: 16GB
>> >> ###################################
>> >> ************ makedumpfile -d 7 ******************
>> >>                   core-data       0       256
>> >>           threads-num
>> >> -l
>> >>           0                       10      144
>> >>           4                       5       110
>> >>           8                       5       111
>> >>           12                      6       111
>> >>
>> >> ************ makedumpfile -d 31 ******************
>> >>                   core-data       0       256
>> >>           threads-num
>> >> -l
>> >>           0                       0       0
>> >>           4                       2       2
>> >>           8                       2       3
>> >>           12                      2       3
>> >>
>> >> ###################################
>> >> - System: PRIMEQUEST 1800E
>> >> - CPU: Intel(R) Xeon(R) CPU E7540
>> >> - memory: 32GB
>> >> ###################################
>> >> ************ makedumpfile -d 7 ******************
>> >>                   core-data        0       256
>> >>           threads-num
>> >> -l
>> >>           0                        34      270
>> >>           4                        63      154
>> >>           8                        64      131
>> >>           12                       65      159
>> >>
>> >> ************ makedumpfile -d 31 ******************
>> >>                   core-data        0       256
>> >>           threads-num
>> >> -l
>> >>           0                        2       1
>> >>           4                        48      48
>> >>           8                        48      49
>> >>           12                       49      50
>> >>
>> >>>> I'm not so sure if it is a problem that the performance degradation is
>> >>>> so
>> >>>> big.
>> >>>> But I think if in other cases, it works as expected, this won't be a
>> >>>> problem(
>> >>>> or a problem needs to be fixed), for the performance degradation
>> >>>> existing
>> >>>> in theory.
>> >>>>
>> >>>> Or the current implementation should be replaced by a new arithmetic.
>> >>>> For example:
>> >>>> We can add an array to record whether the page is filtered or not.
>> >>>> And only the unfiltered page will take the buffer.
>> >>>
>> >>> We should discuss how to implement new mechanism, I'll mention this
>> >>> later.
>> >>>
>> >>>> But I'm not sure if it is worth.
>> >>>> For "-l -d 31" is fast enough, the new arithmetic also can't do much
>> >>>> help.
>> >>>
>> >>> Basically the faster, the better. There is no obvious target time.
>> >>> If there is room for improvement, we should do it.
>> >>>
>> >>
>> >> Maybe we can improve the performance of "-c -d 31" in some case.
>> >>
>> >> BTW, we can easily get the theoretical performance by using the "--split".
>> >>
>> >> --
>> >> Thanks
>> >> Zhou
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> kexec mailing list
>> >> kexec at lists.infradead.org
>> >> http://lists.infradead.org/mailman/listinfo/kexec
>> >>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> kexec mailing list
>> kexec at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>> 
> 
> _______________________________________________
> kexec mailing list
> kexec at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
--
Thanks.
HATAYAMA, Daisuke