UBIFS errors are randomly seen after reboots

Fri Feb 3 01:30:00 PST 2017

Hi Richard,
I've taken the UBIFS changes backported to 3.2 kernel from
git://git.infradead.org/users/dedekind/ubifs-v3.2.git. With these
changes , the frequency of the root-filesystem mount failure is
reduced but it is seen some times. One thing I noticed is that
root-filesystem mount failures are seen when the prior reboot
encountered errors during un-mounting root filesystem. I modified the
UBIFS error debug print by adding the process name along with pid.

 Below are the excerpt of the errors seen during reboots

UBIFS error (pid 9515, process umount): dbg_check_space_info: free
space changed from 14128638 to 14743030
UBIFS assert failed in reserve_space at 125 (pid 9529)
UBIFS assert failed in ubifs_write_begin at 436 (pid 9529)
[  234.725204] Backtrace:
[  234.725217] [<c40113a0>] (dump_backtrace+0x0/0x110) from
[<c44120e8>] (dump_stack+0x18/0x1c)
[  234.725224]  r6:e5470120 r5:e5470060 r4:002bb000 r3:00000000
[  234.725239] [<c44120d0>] (dump_stack+0x0/0x1c) from [<c417d024>]
(ubifs_write_begin+0x9c/0x498)
[  234.725251] [<c417cf88>] (ubifs_write_begin+0x0/0x498) from
[<c40a5bc4>] (generic_file_buffered_write+0xe0/0x234)
[  234.725264] [<c40a5ae4>] (generic_file_buffered_write+0x0/0x234)
from [<c40a7794>] (__generic_file_aio_write+0x3fc/0x440)
[  234.725276] [<c40a7398>] (__generic_file_aio_write+0x0/0x440) from
[<c40a7844>] (generic_file_aio_write+0x6c/0xd0)
[  234.725288] [<c40a77d8>] (generic_file_aio_write+0x0/0xd0) from
[<c417c7a8>] (ubifs_aio_write+0x16c/0x180)
[  234.725296]  r8:e653b000 r7:e4cc9f78 r6:e4cc9ea8 r5:e5470060 r4:e609f800
[  234.725313] [<c417c63c>] (ubifs_aio_write+0x0/0x180) from
[<c40d72ac>] (do_sync_write+0xa0/0xe0)
[  234.725325] [<c40d720c>] (do_sync_write+0x0/0xe0) from [<c40d7c00>]
(vfs_write+0xbc/0x148)
[  234.725331]  r5:00300000 r4:e609f800
[  234.725342] [<c40d7b44>] (vfs_write+0x0/0x148) from [<c40d7e8c>]
(sys_write+0x48/0x74)
[  234.725349]  r8:c400dd44 r7:00000004 r6:00300000 r5:4039b008 r4:e609f800
[  234.725366] [<c40d7e44>] (sys_write+0x0/0x74) from [<c400dbc0>]
(ret_fast_syscall+0x0/0x30)
[  234.725372]  r6:4039b008 r5:00000001 r4:0008426c

Subsequently during the boot , the root filesystem mount is failing,
below is excerpt from the logs:-

[   10.090852] UBIFS error (pid 1, process swapper/0):
ubifs_check_node: bad CRC: calculated 0xb4b7338e, read 0xe5385648
[   10.101515] UBIFS error (pid 1, process swapper/0):
ubifs_check_node: bad node at LEB 499:98208
[   10.372656] UBIFS error (pid 1, process swapper/0): ubifs_scan: bad node
[   10.379386] UBIFS error (pid 1, process swapper/0):
ubifs_scanned_corruption: corruption at LEB 499:98208
[   10.388988] UBIFS error (pid 1, process swapper/0):
ubifs_scanned_corruption: first 8192 bytes from LEB 499:98208
[   10.403113] UBIFS error (pid 1, process swapper/0): ubifs_scan: LEB
499 scanning failed
[   10.411213] UBIFS: background thread "ubifs_bgt0_0" stops
[   10.467214] VFS: Cannot open root device "ubi0:rootfs" or unknown-block(0,0)
[   10.474287] Please append a correct "root=" boot option; here are
the available partitions:
[   10.482683] 1f00             512 mtdblock0  (driver?)
[   10.487774] 1f01             512 mtdblock1  (driver?)
[   10.492855] 1f02             128 mtdblock2  (driver?)
[   10.497942] 1f03            8192 mtdblock3  (driver?)
[   10.503022] 1f04           94208 mtdblock4  (driver?)
[   10.508110] 1f05             128 mtdblock5  (driver?)
[   10.513190] 1f06            8192 mtdblock6  (driver?)
[   10.518280] 1f07           94208 mtdblock7  (driver?)
[   10.523360] 1f08             128 mtdblock8  (driver?)
[   10.528449] 1f09            2048 mtdblock9  (driver?)
[   10.533529] 1f0a           12288 mtdblock10  (driver?)
[   10.538702] 1f0b           32768 mtdblock11  (driver?)
[   10.543868] 1f0c            2048 mtdblock12  (driver?)
[   10.549048] 1f0d             128 mtdblock13  (driver?)
[   10.554215] 1f0e             512 mtdblock14  (driver?)
[   10.559391] 1f0f             128 mtdblock15  (driver?)
[   10.564559] 1f10             128 mtdblock16  (driver?)
[   10.569735] 1f11              64 mtdblock17  (driver?)
[   10.574901] 1f12              64 mtdblock18  (driver?)
[   10.580074] 1f13              64 mtdblock19  (driver?)
[   10.585240] 0800         3915776 sda  driver: sd
[   10.589895]   0801         3914752 sda1 00000000-0000-0000-0000-000000000000
[   10.596979] 1f14           84320 mtdblock20  (driver?)
[   10.602155] Kernel panic - not syncing: VFS: Unable to mount root
fs on unknown-block(0,0)

Can the relation between erroneous reboot followed by unsuccessful
boot give us any insight ?

Thanks
Chaitanya

On Thu, Jan 26, 2017 at 2:12 PM, Richard Weinberger <richard at nod.at> wrote:
> Chaitanya,
>
> Am 23.01.2017 um 11:48 schrieb chaitanya vinnakota:
>> Hi Richard,
>>
>> We are seeing UBIFS errors even when the root-filesystem  is mounted
>> read-only. But , the error is reported only once.
>> Our test scenario is we are rebooting the device by calling "reboot"
>> from a script ,  during data-write operations to the mtd partitions
>> other than root-filesystem executed by another script.
>>
>> What's more baffling is why root-filesystem UBIFS errors are seen when
>> the data-write operations are  performed on the other partitions and
>> most importantly when rootfs is mounted read-only,
>>
>> [  155.121005] UBIFS error (pid 5040): ubifs_decompress: cannot
>> decompress 2434 bytes, compressor zlib, error -22
>
> The compressor fails to uncompress because the payload is corrupted.
> This can be due to a driver bug, not strong enough ECC, etc..
>
>> [  155.121017] UBIFS error (pid 5040): read_block: bad data node
>> (block 60, inode 3484)
>> [  155.121026] UBIFS error (pid 5040): do_readpage: cannot read page
>> 60 of inode 3484, error -22
>>
>> ECC errors are also observed sometimes when the rootfs is mounted read-only
>>
>> [  154.824361] ECC: uncorrectable error 2 !!!
>> [  154.824368] ECC correction failed for page 0x00014b58
>> [  154.825474] ECC: uncorrectable error 2 !!!
>> [  154.825479] ECC correction failed for page 0x00014b58
>> [  154.825604] UBI warning: ubi_io_read: error -74 (ECC error) while
>
> Here we have a classic ECC error. This should not happen on fresh
> NANDs.
>
>> reading 188 bytes from PEB 451:50368, read only 188 bytes, retry
>>
>> The page 0x00014b58 falls in the rootfs partition. But , nanddump
>> utility is not reporting any bad blocks from the rootfs partition .
>>
>>  ~# nanddump /dev/mtd4
>> nanddump: warning!: you did not specify a default bad-block handling
>>   method. In future versions, the default will change to
>>   --bb=skipbad. Use "nanddump --help" for more information.
>> nanddump: warning!: in next release, nanddump will not dump OOB
>>   by default. Use `nanddump --oob' explicitly to ensure
>>   it is dumped.
>> ECC failed: 0
>> ECC corrected: 0
>> Number of bad blocks: 0
>> Number of bbt blocks: 0
>>
>>  We ran mtd and ubi tests , mtd all tests passed but UBI one test i.e
>> io_basic failed.
>>
>> Can you please help us in this regard. Any inputs or suggestions ?
>
> This is not easy. I suggest to double check all NAND/MTD settings from
> group up. Timings, ECC strength, basic driver testing...
>
> Thanks,
> //richard