Corrupt Empty Space Error at Runtime

Fri Dec 18 13:49:19 PST 2015

On Fri, Dec 18, 2015 at 5:38 PM, Adam <aps337 at gmail.com> wrote:
> Hello All,
>
> I am working on a at91sama5d3x based system running linux 3.18.9. I
> have been seeing an issue where during normal operation, I see the
> following....
>
>    kern.warn kernel: [<c00cabf4>] (vfs_fsync) from [<c025e2ec>]
> (loop_thread+0x420/0x740)
>    kern.warn kernel: [<c017cb64>] (ubifs_fsync) from [<c00cabf4>]
> (vfs_fsync+0x34/0x44)
>    kern.warn kernel: [<c006b3b8>] (filemap_write_and_wait_range) from
> [<c017cb64>] (ubifs_fsync+0x40/0xb4)
>    kern.warn kernel: [<c006b294>] (__filemap_fdatawrite_range) from
> [<c006b3b8>] (filemap_write_and_wait_range+0x34/0x74)
>    kern.warn kernel: [<c0073150>] (generic_writepages) from
> [<c006b294>] (__filemap_fdatawrite_range+0x4c/0x54)
>    kern.warn kernel: [<c0072f60>] (write_cache_pages) from
> [<c0073150>] (generic_writepages+0x40/0x60)
>    kern.warn kernel: [<c00727b4>] (__writepage) from [<c0072f60>]
> (write_cache_pages+0x1c4/0x374)
>    kern.warn kernel: [<c017c49c>] (do_writepage) from [<c00727b4>]
> (__writepage+0x14/0x5c)
>    kern.warn kernel: [<c017a6ec>] (ubifs_jnl_write_data) from
> [<c017c49c>] (do_writepage+0x94/0x1f4)
>    kern.warn kernel: [<c0179a54>] (make_reservation) from [<c017a6ec>]
> (ubifs_jnl_write_data+0xec/0x274)
>    kern.warn kernel: [<c01918dc>] (ubifs_garbage_collect) from
> [<c0179a54>] (make_reservation+0x108/0x46c)
>    kern.warn kernel: [<c00110b0>] (show_stack) from [<c01918dc>]
> (ubifs_garbage_collect+0x1d4/0x3e0)
>    kern.warn kernel: [<c00133fc>] (unwind_backtrace) from [<c00110b0>]
> (show_stack+0x10/0x14)
>    kern.warn kernel: CPU: 0 PID: 676 Comm: loop0 Not tainted 3.18.9 #1
>    kern.warn kernel: UBIFS warning (pid 676): ubifs_ro_mode: switched
> to read-only mode, error -117
>    kern.err kernel:  UBIFS error (pid 676): ubifs_scan: LEB 846 scanning failed
>    kern.debug kernel: 00001fe0: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    kern.debug kernel: 00001fc0: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    kern.debug kernel: 00001fa0: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    kern.debug kernel: 00001f80: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    kern.debug kernel: 00001f60: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    kern.debug kernel: 00001f40: ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff ffffffff  ................................
>    <snip>
>
>
> In looking at source, appears that the failure scanning that LEB,
> causes the filesystem to be changed to read only mode. Based on the
> source, it also looks like I am losing a couple important debug error
> messages due to issue with our logging infrastructure (unfortunately
> serial console was not attached when failure occurred), but I think
> that we're encountering a 'corrupt empty space' condition. Does this
> seem right?

Can be. But to be sure we need full logs.

> In doing some research (mostly on archives of this mailing list), I
> believe that LEB 846 is an empty space block and that there has been a
> bit flip in it. Based on previous posts here and looking at atmel_nand
> driver, it looks like the atmel_nand driver (and underlying hardware)
> do not support ECC correction of bit flips in empty blocks and UBIFS
> doesn't currently have a way to deal with this.
>
> I see that some folks reported that they just hacked the ubifs_scan
> routine to not consider it corruption if the corrupt block was an
> empty block to workaround this issue. What is the disadvantage to
> doing this? It seems sort of harmless to have errors in empty blocks..
> no?
>
> What are other options? People must have ways of working around this.

UBIFS assumes that reading from empty space works.
It uses this for example at mount time to detect unclean mounts.
e.g. power-cut while erasing or writing.

Sadly some NAND flash controller's ECC functions do not work on empty
space. i.e. CRC(0xff) is not 0xff.

It is still undecided whether this should be addressed in MTD core or within
the individual NAND drivers.

-- 
Thanks,
//richard