[RFC] makedumpfile-1.5.1 RC

Tue Nov 20 08:03:20 EST 2012

On Tue, 2012-11-20 at 16:35 +0000, Vivek Goyal wrote:
> On Tue, Nov 20, 2012 at 05:14:55AM -0700, Lisa Mitchell wrote:
> 
> [..]
> > I tested this makedumpfile v1.5.1-rc on a 4 TB DL980, on 2.6.32 based
> > kernel, and got good results. With crashkernel=256M, and default
> > settings (i.e. no cyclic buffer option selected), the dump successfully
> > completed in about 2 hours, 40 minutes, and then I specified a cyclic
> > buffer size of 48 M, and the dump completed in the same time, no
> > measurable differences within the accuracy of our measurements. 
> 
> This sounds little odd to me.
> 
> - With smaller buffer size of 48M, it should have taken much more time
>   to finish the dump as compared to when no restriction was put on
>   buffer size. I am assuming that out of 256M reserved, say around 128MB
>   was available for makedumpfile to use.
> 
> - Also 2 hours 40 minutes sounds a lot. Is it practical to wait that
>   long for a machine to dump before it can be brought into service
>   again? Do you have any data w.r.t older makedumpfile (which did not
>   have cyclic buffer logic).
> 
> I have some data which I collected in 2008. 128GB system took roughly
> 4 minutes to filter and save dumpfile. So if we scale it linearly
> then it should take around 32minutes per TB. Hence around 2 hours
> 8 minutes for a 4TB systems. Your numbers do seems to be in roughly
> inline.
> 
> Still 2-2.5 hours seems too long to be able to filter and save core of a
> 4TB system. We will probably need to figure out what's taking so much of
> time. May be we need to look into cliff wickman's idea of kernel returning
> list of pfns to dump and make dump 20 time faster. I will love to have 4TB
> system dumped in 6 minutes as opposed to 2 hrs. :-)
> 
> Thanks
> Vivek

As I stated, I don't really have precise performance data here, but the
time I got was comparable to the rough 3-4 hours with a larger
crashkernel size that I got a successful dump on this same system with a
makedumpfile v1.4.  We haven't made a good apples-apples comparison
between the two at this point, but this is how long this 4 TB system has
been taking to dump, dump level =31, so we feel we are in the same
ballpark with makedumpfile v1.5.1.  

It does seem that the "Excluding pages" parts take up a lot of the time
in the dump, as opposed to the copying, but I don't have a good
breakdown.  

I have added the debug mem_level 3 to kdump.conf file, and have seen
used memory on this machine  recorded right before makedumpfile creates
the bitmap and starts filtering be around 140 MB, and have seen
makedumpfile fail, with OOM killer active after this point with a
crashkernel size of 256 MB or 384 MB using makedumpfile v1.4.  

So makedumpfile v1.5.1 solves the above problem, and allows us to
successfully dump a 4 TB system with these smaller crashkernel sizes.

We do need much better performance numbers  to insure no regression from
makedumpfile v1.4, but I wanted you to get the feedback at least of what
testing we had done, and that it appears it is solving the primary
problem we were interested in, that we could dump many terabytes of
memory with crashkernel sizes fixed at 384 MB or below.