Recent 3.x kernels: Memory leak causing OOMs

Sat Mar 15 06:19:52 EDT 2014

On Mon, Feb 17, 2014 at 09:09:54PM +0000, Russell King - ARM Linux wrote:
> On Mon, Feb 17, 2014 at 10:02:31PM +0100, Maxime Bizon wrote:
> > 
> > On Sun, 2014-02-16 at 22:50 +0000, Russell King - ARM Linux wrote:
> > 
> > > http://www.home.arm.linux.org.uk/~rmk/misc/log-20140208.txt
> > 
> > [<c0064ce0>] (__alloc_pages_nodemask+0x0/0x694) from [<c022273c>] (sk_page_frag_refill+0x78/0x108)
> > [<c02226c4>] (sk_page_frag_refill+0x0/0x108) from [<c026a3a4>] (tcp_sendmsg+0x654/0xd1c)  r6:00000520 r5:c277bae0 r4:c68f37c0
> > [<c0269d50>] (tcp_sendmsg+0x0/0xd1c) from [<c028ca9c>] (inet_sendmsg+0x64/0x70)
> > 
> > FWIW I had OOMs with the exact same backtrace on kirkwood platform
> > (512MB RAM), but sorry I don't have the full dump anymore.
> > 
> > I found a slow leaking process, and since I fixed that leak I now have
> > uptime better than 7 days, *but* there was definitely some memory left
> > when the OOM happened, so it appears to be related to fragmentation.
> 
> However, that's a side effect, not the cause - and a patch has been
> merged to fix that OOM - but that doesn't explain where most of the
> memory has gone!
> 
> I'm presently waiting for the machine to OOM again (it's probably going
> to be something like another month) at which point I'll grab the files
> people have been mentioning (/proc/meminfo, /proc/vmallocinfo,
> /proc/slabinfo etc.)

For those new to this report, this is a 3.12.6+ kernel, and I'm seeing
OOMs after a month or two of uptime.

Last night, it OOM'd severely again at around 5am... and rebooted soon
after so we've lost any hope of recovering anything useful from the
machine.

However, the new kernel re-ran the raid check, and...

md: data-check of RAID array md2
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for data-check.
md: using 128k window, over a total of 4194688k.
md: delaying data-check of md3 until md2 has finished (they share one or more physical units)
md: delaying data-check of md4 until md2 has finished (they share one or more physical units)
md: delaying data-check of md3 until md2 has finished (they share one or more physical units)
md: delaying data-check of md5 until md2 has finished (they share one or more physical units)
md: delaying data-check of md3 until md2 has finished (they share one or more physical units)
md: delaying data-check of md4 until md2 has finished (they share one or more physical units)
md: delaying data-check of md6 until md2 has finished (they share one or more physical units)
md: delaying data-check of md4 until md2 has finished (they share one or more physical units)
md: delaying data-check of md3 until md2 has finished (they share one or more physical units)
md: delaying data-check of md5 until md2 has finished (they share one or more physical units)
md: md2: data-check done.
md: delaying data-check of md5 until md3 has finished (they share one or more physical units)
md: data-check of RAID array md3
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for data-check.
md: using 128k window, over a total of 524544k.
md: delaying data-check of md4 until md3 has finished (they share one or more physical units)
md: delaying data-check of md6 until md3 has finished (they share one or more physical units)
kmemleak: 836 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
md: md3: data-check done.
md: delaying data-check of md6 until md4 has finished (they share one or more physical units)
md: delaying data-check of md4 until md5 has finished (they share one or more physical units)
md: data-check of RAID array md5
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for data-check.
md: using 128k window, over a total of 10486080k.
kmemleak: 2235 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
md: md5: data-check done.
md: data-check of RAID array md4
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for data-check.
md: using 128k window, over a total of 10486080k.
md: delaying data-check of md6 until md4 has finished (they share one or more physical units)
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
md: md4: data-check done.
md: data-check of RAID array md6
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec)
for data-check.
md: using 128k window, over a total of 10409472k.
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
kmemleak: 3 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
md: md6: data-check done.
kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

which totals 3077 of leaks.  So we have a memory leak.  Looking at
the kmemleak file:

unreferenced object 0xc3c3f880 (size 256):
  comm "md2_resync", pid 4680, jiffies 638245 (age 8615.570s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 f0  ................
    00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<c008d4f0>] __save_stack_trace+0x34/0x40
    [<c008d5f0>] create_object+0xf4/0x214
    [<c02da114>] kmemleak_alloc+0x3c/0x6c
    [<c008c0d4>] __kmalloc+0xd0/0x124
    [<c00bb124>] bio_alloc_bioset+0x4c/0x1a4
    [<c021206c>] r1buf_pool_alloc+0x40/0x148
    [<c0061160>] mempool_alloc+0x54/0xfc
    [<c0211938>] sync_request+0x168/0x85c
    [<c021addc>] md_do_sync+0x75c/0xbc0
    [<c021b594>] md_thread+0x138/0x154
    [<c0037b48>] kthread+0xb0/0xbc
    [<c0013190>] ret_from_fork+0x14/0x24
    [<ffffffff>] 0xffffffff

with 3077 of these in the debug file.  3075 are for "md2_resync" and
two are for "md4_resync".

/proc/slabinfo shows for this bucket:
kmalloc-256         3237   3450    256   15    1 : tunables  120   60    0 : slabdata    230    230      0

but this would only account for about 800kB of memory usage, which itself
is insignificant - so this is not the whole story.

It seems that this is the culpret for the allocations:
        for (j = pi->raid_disks ; j-- ; ) {
                bio = bio_kmalloc(gfp_flags, RESYNC_PAGES);

Since RESYNC_PAGES will be 64K/4K=16, each struct bio_vec is 12 bytes
(12 * 16 = 192) plus the size of struct bio, which would fall into this
bucket.

I don't see anything obvious - it looks like it isn't every raid check
which loses bios.  Not quite sure what to make of this right now.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.