UBIFS, Memory fragmentation problem
Artem Bityutskiy
dedekind1 at gmail.com
Thu Apr 29 07:00:40 EDT 2010
CCing fs-devel, in case other people can give good suggestions.
On Mon, 2010-04-26 at 15:39 +0200, Tomasz Stanislawski wrote:
> Dear Mr. Artem Bityutskiy,
> Recently, I was developing a platform that utilizes UBIFS. An
> interesting problem
> was encountered. During booting, UBIFS generates error messages and it sets
> file system into read only mode. Please look to appendix A. I have
> investigated
> problem and according to my analysis observed problems are caused by severe
> memory fragmentation. The platform is based on kernel 2.6.29.
>
> Few kernel errors where found in system logs. Please look to appendix A.
> It looks
> that this failure was caused by memory fragmentation. Look at two
> following lines:
>
> DMA: 186*4kB 2*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB 0*8192kB = 760kB
> DMA: 998*4kB 146*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
> 0*2048kB 0*4096kB 0*8192kB = 5160kB
>
> There is no contiguous memory block of size 16 kB. Please consider
> following
> scenario:
>
> Assume that program reached line 916 in file.c (look to Appendix B).
> Now do_writepage function is executed. Now function
> ubifs_jnl_write_data is
> called. Inside function ubifs_jnl_write_data operation kmalloc is called
> (Appendix C, line 697). Driver tries to allocate slightly more than 8 kiB.
> Unfortunately, allocator finds no contiguous memory block long enough. It
> wakes up kswapd daemon. The daemon tries to drop page cache to disk.
> Storing
> data to ubifs partition calls ubifs_jnl_write_data again. Once again it
> tries
> to allocate slightly more than 8 KiB. Allocator detects allocation
> operation
> from procedure called to retain memory. In order to avoid endless loop
> stack
> dump is generated and kmalloc fails. Failure of ubifs_jnl_write_data causes
> failure of kswapd action. Ubifs driver executes code in lines 926-929
> setting
> UBI partition in read-only mode. Since now system becomes unstable, all
> writing
> operation to root file system are denied.
>
> Simple workaround for this problem is changing all kmalloc/kfree to
> vmalloc/vfree. This functions creates virtual memory mapping, so there
> is no
> need to find contiguous memory blocks. Such a patch was attached in the
> file
> 'kmalloc2vmalloc.patch'. Function kmalloc is used often is ubifs driver
> so it is
> possible that the problem might appear somewhere else. Static buffer cannot
> be used because function ubifs_jnl_write_data is both reentrant and in some
> sense 'recursive'.
Reentrant - yes, recursive - no. If kmalloc() cannot find space, it just
returns error, and GFP_NOFS makes sure . There is no recursion. kswapd
is a separate independent process.
> Function ubifs_jnl_write_data calls kmalloc, which calls
> __alloc_pages_internal. This function wakes up kswapd daemon. In order
> to drop
> page cache or buffers it tries to initiate UBIFS operations which include
> calling ubifs_jnl_write_data.
This is indeed a work-around. But I do not want to go for vmalloc,
because lower-level MTD drivers may have difficulties with doing DMA on
vmalloc'ed buffers. Well, there are few places where we use vmalloced
buffers for I/O, and those should be fixed, but I do not want to add
more.
Also, vmalloc has its own issues - the virtual address range for vmalloc
on 32-bit systems is very limited (128MiB or something like that), and
we can start hitting these errors.
But I do not see why 'static buffer' cannot be used. Quite the opposite,
this approach can be used.
You can do something like we do in 'ubifs_bulk_read()':
1. Pre-allocate a 16KiB buffer at mount time (c->write_reserve). Add a
mutex to protect this buffer (c->write_reserve_mutex).
2. in 'ubifs_jnl_write_data()', try to allocate the buffer with
'kmalloc(GFP_NOFS | __GFP_NOWARN)' flag.
3. if it fails, then use 'c->write_reserve' under
'c->write_reserve_mutex'
We could try to use mempools (see mm/mempools.c), but they are designed
for constant size allocations.
> Proposed solution is not sufficient IMHO. A failure of kmalloc
> operations may occur
> sooner or later, causing system crash or malfunction. There are numerous
> calls to kmalloc inside UBIFS code. I wanted to ask you if it makes
> sense to change
> all of them to vmalloc interface. Have you run into such problems with
> memory
> fragmentation?
No, we did not see it. But I do agree we should make UBIFS be more
tolerant to high memory pressure conditions by being ready to not
allocate in the write pat
>
> I hope you find this information useful.
>
> Yours sincerely,
> Tomasz Stanislawski
>
>
> * Appendix A *
>
> <4>[ 1018.882720] <4>kswapd0: page allocation failure. order:2, mode:0x4050
> <4>[ 1018.887611] [<c047fec4>] (dump_stack+0x0/0x14) from [<c01a1b30>]
> (__alloc_pages_internal+0x3a8/0x3d4)
> <4>[ 1018.896906] [<c01a1788>] (__alloc_pages_internal+0x0/0x3d4) from
> [<c01a1bdc>] (__get_free_pages+0x20/0x68)
> <4>[ 1018.906296] [<c01a1bbc>] (__get_free_pages+0x0/0x68) from
> [<c0259b8c>] (ubifs_jnl_write_data+0x30/0x1a4)
> <4>[ 1018.915751] [<c0259b5c>] (ubifs_jnl_write_data+0x0/0x1a4) from
> [<c025b3a4>] (do_writepage+0x9c/0x188)
> <4>[ 1018.924943] [<c025b308>] (do_writepage+0x0/0x188) from
> [<c025b5fc>] (ubifs_writepage+0x16c/0x190)
> <4>[ 1018.933838] r7:c4258000 r6:000009e3 r5:00000000 r4:c0730fa0
> <4>[ 1018.939426] [<c025b490>] (ubifs_writepage+0x0/0x190) from
> [<c01a6f70>] (shrink_page_list+0x3e4/0x7c4)
> <4>[ 1018.948680] [<c01a6b8c>] (shrink_page_list+0x0/0x7c4) from
> [<c01a75ec>] (shrink_list+0x29c/0x5ac)
> <4>[ 1018.957513] [<c01a7350>] (shrink_list+0x0/0x5ac) from [<c01a7b8c>]
> (shrink_zone+0x290/0x344)
> <4>[ 1018.965909] [<c01a78fc>] (shrink_zone+0x0/0x344) from [<c01a817c>]
> (kswapd+0x3c4/0x560)
> <4>[ 1018.973907] [<c01a7db8>] (kswapd+0x0/0x560) from [<c0171318>]
> (kthread+0x54/0x80)
> <4>[ 1018.981504] [<c01712c4>] (kthread+0x0/0x80) from [<c015fdbc>]
> (do_exit+0x0/0x640)
> <4>[ 1018.988826] r5:00000000 r4:00000000
> <4>[ 1018.992331] Mem-info:
> <4>[ 1018.994590] DMA per-cpu:
> <4>[ 1018.997173] CPU 0: hi: 18, btch: 3 usd: 16
> <4>[ 1019.001891] DMA per-cpu:
> <4>[ 1019.004440] CPU 0: hi: 90, btch: 15 usd: 76
> <4>[ 1019.009249] <4>[ 1019.009281] <4>[ 1019.009315] Active_anon:10728
> active_file:3486 inactive_anon:10797
> inactive_file:6444 unevictable:3641 dirty:1 writeback:1 unstable:0
> free:1480 slab:2573 mapped:12344 pagetables:1118 bounce:0
> <4>[ 1019.029203] DMA free:760kB min:548kB low:684kB high:820kB
> active_anon:608kB inactive_anon:688kB active_file:52kB inactive_file:0kB
> unevictable:516kB present:80264kB pages_scanned:2 all_unreclaimable? no
> <4>[ 1019.047162] lowmem_reserve[]: 0 0 0
> <4>[ 1019.050559] DMA free:5160kB min:1780kB low:2224kB high:2668kB
> active_anon:42304kB inactive_anon:42500kB active_file:13892kB
> inactive_file:25776kB unevictable:14048kB present:260096kB
> pages_scanned:0 all_unreclaimable? no
> <4>[ 1019.070274] lowmem_reserve[]: 0 0 0
> <4>[ 1019.073514] DMA: 186*4kB 2*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 760kB
> <4>[ 1019.084267] DMA: 998*4kB 146*8kB 0*16kB 0*32kB 0*64kB 0*128kB
> 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB = 5160kB
> <4>[ 1019.095285] 19144 total pagecache pages
> <4>[ 1019.099144] 4569 pages in swap cache
> <4>[ 1019.102719] Swap cache stats: add 66771, delete 62202, find 8700/13664
> <4>[ 1019.109204] Free swap = 13540kB
> <4>[ 1019.112412] Total swap = 99992kB
> <4>[ 1019.133396] 85760 pages of RAM
> <4>[ 1019.135006] 1943 free pages
> <4>[ 1019.137860] 32559 reserved pages
> <4>[ 1019.140990] 2129 slab pages
> <4>[ 1019.143814] 41254 pages shared
> <4>[ 1019.146801] 4569 pages swap cached
> <4>[ 1019.150247] <3>UBIFS error (pid 395): do_writepage: cannot write
> page 1 of inode 13633, error -12
> <4>[ 1019.159087] <4>UBIFS warning (pid 395): ubifs_ro_mode: switched to
> read-only mode, error -12
> <4>[ 1019.306567] <3>UBIFS error (pid 395): make_reservation: cannot
> reserve 529 bytes in jhead 2, error -30
> <4>[ 1019.314216] <3>UBIFS error (pid 395): do_writepage: cannot write
> page 0 of inode 13633, error -30
>
>
> * Appendix B *
>
> File fs/ubifs/file.c:910
> 910 addr = kmap(page);
> 911 block = page->index << UBIFS_BLOCKS_PER_PAGE_SHIFT;
> 912 i = 0;
> 913 while (len) {
> 914 blen = min_t(int, len, UBIFS_BLOCK_SIZE);
> 915 data_key_init(c, &key, inode->i_ino, block);
> 916 err = ubifs_jnl_write_data(c, inode, &key, addr,
> blen);
> 917 if (err)
> 918 break;
> 919 if (++i >= UBIFS_BLOCKS_PER_PAGE)
> 920 break;
> 921 block += 1;
> 922 addr += blen;
> 923 len -= blen;
> 924 }
> 925 if (err) {
> 926 SetPageError(page);
> 927 ubifs_err("cannot write page %lu of inode %lu,
> error %d",
> 928 page->index, inode->i_ino, err);
> 929 ubifs_ro_mode(c, err);
> 930 }
>
> * Appendix C *
>
> File fs/ubifs/journal.c:684
> 684 int ubifs_jnl_write_data(struct ubifs_info *c, const struct inode
> *inode,
> 685 const union ubifs_key *key, const void
> *buf, int len)
> 686 {
> 687 struct ubifs_data_node *data;
> 688 int err, lnum, offs, compr_type, out_len;
> 689 int dlen = UBIFS_DATA_NODE_SZ + UBIFS_BLOCK_SIZE *
> WORST_COMPR_FACTOR;
> 690 struct ubifs_inode *ui = ubifs_inode(inode);
> 691 692 dbg_jnl("ino %lu, blk %u, len %d, key %s",
> 693 (unsigned long)key_inum(c, key), key_block(c,
> key), len,
> 694 DBGKEY(key));
> 695 ubifs_assert(len <= UBIFS_BLOCK_SIZE);
> 696 697 data = kmalloc(dlen, GFP_NOFS);
> 698 if (!data)
> 699 return -ENOMEM;
> 700 701 data->ch.node_type = UBIFS_DATA_NODE;
>
> differences between files attachment (kmalloc2vmalloc.patch)
> From 44a4267d6f9df2a65c7b2a6d45942a2ddd587309 Mon Sep 17 00:00:00 2001
> From: Tomasz Stanislawski <t.stanislaws at samsung.com>
> Date: Wed, 21 Apr 2010 14:59:33 +0200
> Subject: [PATCH] changed kmalloc to vmalloc to fix fragmentation problem
>
> ---
> fs/ubifs/journal.c | 6 +++---
> 1 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ubifs/journal.c b/fs/ubifs/journal.c
> index 64b5f3a..29d3db1 100644
> --- a/fs/ubifs/journal.c
> +++ b/fs/ubifs/journal.c
> @@ -694,7 +694,7 @@ int ubifs_jnl_write_data(struct ubifs_info *c, const struct inode *inode,
> DBGKEY(key));
> ubifs_assert(len <= UBIFS_BLOCK_SIZE);
>
> - data = kmalloc(dlen, GFP_NOFS);
> + data = vmalloc(dlen);
> if (!data)
> return -ENOMEM;
>
> @@ -732,7 +732,7 @@ int ubifs_jnl_write_data(struct ubifs_info *c, const struct inode *inode,
> goto out_ro;
>
> finish_reservation(c);
> - kfree(data);
> + vfree(data);
> return 0;
>
> out_release:
> @@ -741,7 +741,7 @@ out_ro:
> ubifs_ro_mode(c, err);
> finish_reservation(c);
> out_free:
> - kfree(data);
> + vfree(data);
> return err;
> }
>
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
--
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
More information about the linux-mtd
mailing list