[PATCH RFC 00/11] makedumpfile: parallel processing

Wed Dec 23 19:22:28 PST 2015

From: Chao Fan <cfan at redhat.com>
Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
Date: Wed, 23 Dec 2015 21:20:48 -0500

> 
> 
> ----- Original Message -----
>> From: "HATAYAMA Daisuke" <d.hatayama at jp.fujitsu.com>
>> To: cfan at redhat.com
>> Cc: ats-kumagai at wm.jp.nec.com, zhouwj-fnst at cn.fujitsu.com, kexec at lists.infradead.org
>> Sent: Tuesday, December 22, 2015 4:32:25 PM
>> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> 
>> Chao,
>> 
>> From: Chao Fan <cfan at redhat.com>
>> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> Date: Thu, 10 Dec 2015 05:54:28 -0500
>> 
>> > 
>> > 
>> > ----- Original Message -----
>> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst at cn.fujitsu.com>
>> >> To: "Chao Fan" <cfan at redhat.com>
>> >> Cc: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>,
>> >> kexec at lists.infradead.org
>> >> Sent: Thursday, December 10, 2015 6:32:32 PM
>> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> >> 
>> >> On 12/10/2015 05:58 PM, Chao Fan wrote:
>> >> >
>> >> >
>> >> > ----- Original Message -----
>> >> >> From: "Wenjian Zhou/周文剑" <zhouwj-fnst at cn.fujitsu.com>
>> >> >> To: "Atsushi Kumagai" <ats-kumagai at wm.jp.nec.com>
>> >> >> Cc: kexec at lists.infradead.org
>> >> >> Sent: Thursday, December 10, 2015 5:36:47 PM
>> >> >> Subject: Re: [PATCH RFC 00/11] makedumpfile: parallel processing
>> >> >>
>> >> >> On 12/10/2015 04:14 PM, Atsushi Kumagai wrote:
>> >> >>>> Hello Kumagai,
>> >> >>>>
>> >> >>>> On 12/04/2015 10:30 AM, Atsushi Kumagai wrote:
>> >> >>>>> Hello, Zhou
>> >> >>>>>
>> >> >>>>>> On 12/02/2015 03:24 PM, Dave Young wrote:
>> >> >>>>>>> Hi,
>> >> >>>>>>>
>> >> >>>>>>> On 12/02/15 at 01:29pm, "Zhou, Wenjian/周文剑" wrote:
>> >> >>>>>>>> I think there is no problem if other test results are as
>> >> >>>>>>>> expected.
>> >> >>>>>>>>
>> >> >>>>>>>> --num-threads mainly reduces the time of compressing.
>> >> >>>>>>>> So for lzo, it can't do much help at most of time.
>> >> >>>>>>>
>> >> >>>>>>> Seems the help of --num-threads does not say it exactly:
>> >> >>>>>>>
>> >> >>>>>>>       [--num-threads THREADNUM]:
>> >> >>>>>>>           Using multiple threads to read and compress data of each
>> >> >>>>>>>           page
>> >> >>>>>>>           in parallel.
>> >> >>>>>>>           And it will reduces time for saving DUMPFILE.
>> >> >>>>>>>           This feature only supports creating DUMPFILE in
>> >> >>>>>>>           kdump-comressed format from
>> >> >>>>>>>           VMCORE in kdump-compressed format or elf format.
>> >> >>>>>>>
>> >> >>>>>>> Lzo is also a compress method, it should be mentioned that
>> >> >>>>>>> --num-threads only
>> >> >>>>>>> supports zlib compressed vmcore.
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>> Sorry, it seems that something I said is not so clear.
>> >> >>>>>> lzo is also supported. Since lzo compresses data at a high speed,
>> >> >>>>>> the
>> >> >>>>>> improving of the performance is not so obvious at most of time.
>> >> >>>>>>
>> >> >>>>>>> Also worth to mention about the recommended -d value for this
>> >> >>>>>>> feature.
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>> Yes, I think it's worth. I forgot it.
>> >> >>>>>
>> >> >>>>> I saw your patch, but I think I should confirm what is the problem
>> >> >>>>> first.
>> >> >>>>>
>> >> >>>>>> However, when "-d 31" is specified, it will be worse.
>> >> >>>>>> Less than 50 buffers are used to cache the compressed page.
>> >> >>>>>> And even the page has been filtered, it will also take a buffer.
>> >> >>>>>> So if "-d 31" is specified, the filtered page will use a lot
>> >> >>>>>> of buffers. Then the page which needs to be compressed can't
>> >> >>>>>> be compressed parallel.
>> >> >>>>>
>> >> >>>>> Could you explain why compression will not be parallel in more
>> >> >>>>> detail ?
>> >> >>>>> Actually the buffers are used also for filtered pages, it sounds
>> >> >>>>> inefficient.
>> >> >>>>> However, I don't understand why it prevents parallel compression.
>> >> >>>>>
>> >> >>>>
>> >> >>>> Think about this, in a huge memory, most of the page will be
>> >> >>>> filtered,
>> >> >>>> and
>> >> >>>> we have 5 buffers.
>> >> >>>>
>> >> >>>> page1       page2      page3     page4     page5      page6
>> >> >>>> page7
>> >> >>>> .....
>> >> >>>> [buffer1]   [2]        [3]       [4]       [5]
>> >> >>>> unfiltered  filtered   filtered  filtered  filtered   unfiltered
>> >> >>>> filtered
>> >> >>>>
>> >> >>>> Since filtered page will take a buffer, when compressing page1,
>> >> >>>> page6 can't be compressed at the same time.
>> >> >>>> That why it will prevent parallel compression.
>> >> >>>
>> >> >>> Thanks for your explanation, I understand.
>> >> >>> This is just an issue of the current implementation, there is no
>> >> >>> reason to stand this restriction.
>> >> >>>
>> >> >>>>> Further, according to Chao's benchmark, there is a big performance
>> >> >>>>> degradation even if the number of thread is 1. (58s vs 240s)
>> >> >>>>> The current implementation seems to have some problems, we should
>> >> >>>>> solve them.
>> >> >>>>>
>> >> >>>>
>> >> >>>> If "-d 31" is specified, on the one hand we can't save time by
>> >> >>>> compressing
>> >> >>>> parallel, on the other hand we will introduce some extra work by
>> >> >>>> adding
>> >> >>>> "--num-threads". So it is obvious that it will have a performance
>> >> >>>> degradation.
>> >> >>>
>> >> >>> Sure, there must be some overhead due to "some extra work"(e.g.
>> >> >>> exclusive
>> >> >>> lock),
>> >> >>> but "--num-threads=1 is 4 times slower than --num-threads=0" still
>> >> >>> sounds
>> >> >>> too slow, the degradation is too big to be called "some extra work".
>> >> >>>
>> >> >>> Both --num-threads=0 and --num-threads=1 are serial processing,
>> >> >>> the above "buffer fairness issue" will not be related to this
>> >> >>> degradation.
>> >> >>> What do you think what make this degradation ?
>> >> >>>
>> >> >>
>> >> >> I can't get such result at this moment, so I can't do some further
>> >> >> investigation
>> >> >> right now. I guess it may be caused by the underlying implementation of
>> >> >> pthread.
>> >> >> I reviewed the test result of the patch v2 and found in different
>> >> >> machines,
>> >> >> the results are quite different.
>> >> >
>> >> > Hi Zhou Wenjian,
>> >> >
>> >> > I have done more tests in another machine with 128G memory, and get the
>> >> > result:
>> >> >
>> >> > the size of vmcore is 300M in "-d 31"
>> >> > makedumpfile -l --message-level 1 -d 31:
>> >> > time: 8.6s      page-faults: 2272
>> >> >
>> >> > makedumpfile -l --num-threads 1 --message-level 1 -d 31:
>> >> > time: 28.1s     page-faults: 2359
>> >> >
>> >> >
>> >> > and the size of vmcore is 2.6G in "-d 0".
>> >> > In this machine, I get the same result as yours:
>> >> >
>> >> >
>> >> > makedumpfile -c --message-level 1 -d 0:
>> >> > time: 597s      page-faults: 2287
>> >> >
>> >> > makedumpfile -c --num-threads 1 --message-level 1 -d 0:
>> >> > time: 602s      page-faults: 2361
>> >> >
>> >> > makedumpfile -c --num-threads 2 --message-level 1 -d 0:
>> >> > time: 337s      page-faults: 2397
>> >> >
>> >> > makedumpfile -c --num-threads 4 --message-level 1 -d 0:
>> >> > time: 175s      page-faults: 2461
>> >> >
>> >> > makedumpfile -c --num-threads 8 --message-level 1 -d 0:
>> >> > time: 103s      page-faults: 2611
>> >> >
>> >> >
>> >> > But the machine of my first test is not under my control, should I wait
>> >> > for
>> >> > the first machine to do more tests?
>> >> > If there are still some problems in my tests, please tell me.
>> >> >
>> >> 
>> >> Thanks a lot for your test, it seems that there is nothing wrong.
>> >> And I haven't got any idea about more tests...
>> >> 
>> >> Could you provide the information of your cpu ?
>> >> I will do some further investigation later.
>> >> 
>> > 
>> > OK, of course, here is the information of cpu:
>> > 
>> > # lscpu
>> > Architecture:          x86_64
>> > CPU op-mode(s):        32-bit, 64-bit
>> > Byte Order:            Little Endian
>> > CPU(s):                48
>> > On-line CPU(s) list:   0-47
>> > Thread(s) per core:    1
>> > Core(s) per socket:    6
>> > Socket(s):             8
>> > NUMA node(s):          8
>> > Vendor ID:             AuthenticAMD
>> > CPU family:            16
>> > Model:                 8
>> > Model name:            Six-Core AMD Opteron(tm) Processor 8439 SE
>> > Stepping:              0
>> > CPU MHz:               2793.040
>> > BogoMIPS:              5586.22
>> > Virtualization:        AMD-V
>> > L1d cache:             64K
>> > L1i cache:             64K
>> > L2 cache:              512K
>> > L3 cache:              5118K
>> > NUMA node0 CPU(s):     0,8,16,24,32,40
>> > NUMA node1 CPU(s):     1,9,17,25,33,41
>> > NUMA node2 CPU(s):     2,10,18,26,34,42
>> > NUMA node3 CPU(s):     3,11,19,27,35,43
>> > NUMA node4 CPU(s):     4,12,20,28,36,44
>> > NUMA node5 CPU(s):     5,13,21,29,37,45
>> > NUMA node6 CPU(s):     6,14,22,30,38,46
>> > NUMA node7 CPU(s):     7,15,23,31,39,47
>> 
>> This CPU assignment on NUMA nodes looks interesting. Is it possible
>> that this affects performance of makedumpfile? This is just a guess.
>> 
>> Could you check whether the performance gets imporoved if you run each
>> thread on the same NUMA node? For example:
>> 
>>   # taskset -c 0,8,16,24 makedumpfile --num-threads 4 -c -d 0 vmcore
>>   vmcore-cd0
>> 
> Hi HATAYAMA,
> 
> I think your guess is right, but maybe your command has a little problem.
> 
> From my test, the NUMA did affect the performance, but not too much.
> The average time of cpus in the same NUMA node: 
> # taskset -c 0,8,16,24,32 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
> is 314s
> The average time of cpus in different NUMA node:
> # taskset -c 2,3,5,6,7 makedumpfile --num-threads 4 -c -d 0 vmcore vmcore-cd0
> is 354s
>

Hmm, according to some previous discussion, what we should see here is
whether it affects performance of makedumpfile with --num-threads 1
and -d 31. So you should need to compare:

    # taskset 0,8 makedumpfile --num-threads 1 -c -d 31 vmcore vmcore-d31

with:

    # taskset 0 makedumpfile -c -d 0 vmcore vmcore-d31

Also, I'm assuming that you've done these benchmark on kdump 1st
kernel, not kdump 2nd kernel. Is this correct?

> But I think if you want to use "--num-threads 4", the --cpu-list numbers
> following "taskset -c" should be 5 cpus at least, otherwise the time will be too
> long.
> 

I see.

--
Thanks.
HATAYAMA, Daisuke