UBIFS does not mount after powerfail

Sun Nov 12 11:26:38 PST 2017

Hello together,

in a power fail stress test with ubifs, I ran into 5 images that do not 
mount:

3x ubifs_check_node() fails, called by ubifs_garbage_collect_leb()
2x UBIFS assert failed in ubifs_rcvry_gc_commit at 1219 (pid 9570)
(and ~550 images with just asserts, the mount was successful)

The test was done using kernel 4.14.0-rc5+ #105 SMP PREEMPT.
The test involves create/rename/remove, using ecryptfs with xattr.

Are these known issues?
Are there any patches that I should apply?
What would you need to analyze them?

If you need the image (~32 MB), I'll try to find a suitable server.
Below is the dmesg output from mounting the first bad image.

For me, it appears as if ubifs_check_node() tries to access a node that 
is only partially written:

(from image hexdump):

> 00D61780   6C 00 00 00  09 00 00 00  04 00 00 00  E8 00 00 00  20 7B 
> 00 00  A0 00 00 00  67 9A 00 00  00 00 00 00  l............... 
> {......g.......
> 00D617A0   0C 00 00 00  10 AB 01 00  46 00 00 00  67 9A 00 00 0E 8D D5 
> 7B  E8 00 00 00  C0 7B 00 00  A0 00 00 00 ........F...g......{.....{......
> 00D617C0   68 9A 00 00  00 00 00 00  0C 00 00 00  00 B0 01 00 46 00 00 
> 00  68 9A 00 00  0E 8D D5 7B  00 00 00 00 h...............F...h......{....
> 00D617E0   31 18 10 06  FB 80 A3 3C  2A 1F 0C 00  00 00 00 00 6C 00 00 
> 00  09 00 00 00  04 00 01 00  55 00 00 00 1......<*.......l...........U...
> 00D61800   FF FF FF FF  FF FF FF FF  FF FF FF FF  FF FF FF FF FF FF FF 
> FF  FF FF FF FF  FF FF FF FF  FF FF FF FF ................................
> 00D61820   FF FF FF FF  FF FF FF FF  FF FF FF FF  FF FF FF FF FF FF FF 
> FF  FF FF FF FF  FF FF FF FF  FF FF FF FF ................................
The next write operation would have written this to 0xD16800:

> 00D617E0   31 18 10 06  FB 80 A3 3C  2A 1F 0C 00  00 00 00 00  6C 00 
> 00 00  09 00 00 00  04 00 01 00  55 00 00 00  
> 1......<*.......l...........U...
> 00D61800   20 06 00 00  6C 00 00 00  5F 9A 00 00  00 00 00 00 55 00 00 
> 00  90 06 00 00  6C 00 00 00  61 9A 00 00 ...l..._.......U.......l...a...
> 00D61820   00 00 00 00  55 00 00 00  00 07 00 00  6C 00 00 00 64 9A 00 
> 00  00 00 00 00  55 00 00 00  70 07 00 00 ....U.......l...d.......U...p...
> 00D61840   6C 00 00 00  67 9A 00 00  00 00 00 00  90 99 00 00 31 18 10 
> 06  A2 02 94 35  2B 1F 0C 00  00 00 00 00 l...g...........1......5+.......
> 00D61860   6C 00 00 00  09 00 00 00  04 00 02 00  55 00 00 00 50 01 00 
> 00  6C 00 00 00  3F 9A 00 00  00 00 00 00 l...........U...P...l...?.......
> 00D61880   55 00 00 00  80 03 00 00  6C 00 00 00  49 9A 00 00 00 00 00 
> 00  55 00 00 00  B0 05 00 00  6C 00 00 00 U.......l...I.......U.......l...
> 00D618A0   54 9A 00 00  00 00 00 00  55 00 00 00  E0 07 00 00 6C 00 00 
> 00  5F 9A 00 00  00 00 00 00  01 01 00 00 T.......U.......l..._...........
> 00D618C0   31 18 10 06  2C 7B AB 32  2C 1F 0C 00  00 00 00 00 6C 00 00 
> 00  09 00 00 00  04 00 00 00  0C 00 00 00 1...,{.2,.......l...............

Note:

The test was done with a modified mtdram device that creates a log of 
all write and erase operations, so the issues are not caused by HW issues.
The log contains checksum for everything, so the tool can't be the 
reason for the mount failure.
(and after 5 images, mount is again successful.)

168166: FUNC_WRITE_CHK(addr=0xd60800, len=0x800, chk=0x3335956be2). <<< 
last ok image
168167: FUNC_WRITE_CHK(addr=0xd61000, len=0x800, chk=0x34dad52f0e). <<< 
garbage_collect_leb
168168: FUNC_WRITE_CHK(addr=0xd61800, len=0x800, chk=0x350a12238d). <<< -"-
168169: FUNC_WRITE_CHK(addr=0xd62000, len=0x800, chk=0x3573a0893b). <<< -"-
168170: FUNC_WRITE_CHK(addr=0xd62800, len=0x800, chk=0x356a0366c2). <<< 
assert line 1219
168171: FUNC_WRITE_CHK(addr=0x1b34000, len=0x800, chk=0x3ca1e19062). <<< 
-"-
168172: FUNC_WRITE_CHK(addr=0xc35800, len=0x800, chk=0x334dc94b1f). <<< 
first ok image.

--

     Manfred

<<<<<<<<<<

[61821.708507] ubi0: attaching mtd0
[61821.710142] ubi0: scanning is finished
[61821.712212] ubi0: attached mtd0 (name "mtdram test device", size 32 MiB)
[61821.713285] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 
bytes
[61821.714336] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[61821.715394] ubi0: VID header offset: 2048 (aligned 2048), data 
offset: 4096
[61821.716425] ubi0: good PEBs: 256, bad PEBs: 0, corrupted PEBs: 0
[61821.717447] ubi0: user volume: 1, internal volumes: 1, max. volumes 
count: 128
[61821.718481] ubi0: max/mean erase counter: 85/32, WL threshold: 4096, 
image sequence number: 18852820
[61821.719566] ubi0: available PEBs: 4, total reserved PEBs: 252, PEBs 
reserved for bad PEB handling: 0
[61821.720694] ubi0: background thread "ubi_bgt0d" started, PID 9421
[61821.725737] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, 
PID 9428
[61821.725913] UBIFS (ubi0:0): recovery needed
[61821.743777] UBIFS error (ubi0:0 pid 9426): ubifs_check_node: bad CRC: 
calculated 0xbe6fe258, read 0x3ca380fb
[61821.744917] UBIFS error (ubi0:0 pid 9426): ubifs_check_node: bad node 
at LEB 85:2016
[61821.746174]  magic          0x6101831
[61821.747274]  crc            0x3ca380fb
[61821.748334]  node_type      9 (indexing node)
[61821.749379]  group_type     0 (no node group)
[61821.750404]  sqnum          794410
[61821.751401]  len            108
[61821.752379]  child_cnt      4
[61821.753341]  level          1
[61821.754288]  Branches:
[61821.755212]  0: LEB 85:-1 len -1 key (bad key type: 0xffffffff, 
0xffffffff)
[61821.756166]  1: LEB -1:-1 len -1 key (bad key type: 0xffffffff, 
0xffffffff)
[61821.757091]  2: LEB -1:-1 len -1 key (bad key type: 0xffffffff, 
0xffffffff)
[61821.758027]  3: LEB -1:-1 len -1 key (bad key type: 0xffffffff, 
0xffffffff)
[61821.758939] CPU: 1 PID: 9426 Comm: mount Not tainted 4.14.0-rc5+ #105
[61821.759848] Hardware name: Hewlett-Packard HP 350 G2/803A, BIOS F.13 
06/10/2015
[61821.760777] Call Trace:
[61821.761690]  ? dump_stack+0x67/0x8e
[61821.762610]  ? ubifs_check_node+0x1b1/0x1be
[61821.763539]  ? ubifs_scan_a_node+0x153/0x223
[61821.764466]  ? ubifs_scan+0xab/0x23a
[61821.765387]  ? ubifs_garbage_collect_leb+0x1dd/0x735
[61821.766323]  ? ubifs_rcvry_gc_commit+0x1bc/0x27b
[61821.767254]  ? ubifs_mount+0x1050/0x1ad4
[61821.768177]  ? sb_set+0x13/0x13
[61821.769086]  ? mount_fs+0x6a/0x109
[61821.769975]  ? vfs_kern_mount+0x6b/0x151
[61821.770851]  ? do_mount+0x914/0xb60
[61821.771701]  ? memdup_user+0x3e/0x5a
[61821.772523]  ? SyS_mount+0x72/0x98
[61821.773317]  ? entry_SYSCALL_64_fastpath+0x1c/0xb1
[61821.774138] UBIFS error (ubi0:0 pid 9426): ubifs_scan: bad node
[61821.774974] UBIFS error (ubi0:0 pid 9426): ubifs_scanned_corruption: 
corruption at LEB 85:2016
[61821.775820] UBIFS error (ubi0:0 pid 9426): ubifs_scanned_corruption: 
first 8192 bytes from LEB 85:2016
[61821.776679] 00000000: 06101831 3ca380fb 000c1f2a 00000000 0000006c 
00000009 00010004 00000055 1......<*.......l...........U...
[61821.776683] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff 
ffffffff ffffffff ffffffff ................................
[61821.776687] 00000040: ffffffff ffffffff ffffffff ffffffff ffffffff 
ffffffff ffffffff ffffffff ................................
[... all 0xff 0xff]
[61821.777645] UBIFS error (ubi0:0 pid 9426): ubifs_scan: LEB 85 
scanning failed
[61821.778501] UBIFS error (ubi0:0 pid 9426): ubifs_rcvry_gc_commit: GC 
failed, error -117
[61821.779948] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" stops
<<<<<<<<<<