UBIFS Corrupt during power failure

Fri Jul 24 09:39:05 EDT 2009

> More testing of NOR flash against power cuts showed that sometimes
> eraseblocks may be unwritable, and we cannot really invalidate
> them before erasure. But in this case the eraseblock probably
> contains garbage anyway, and we do not have to invalidate the
> headers. This assumption might be not true, but this is at least
> what I have observed. So if we cannot invalidate the headers,
> we make sure that the PEB does not contain valid VID header.
> If this is true, everything is fine, otherwise we panic.

I haven't seen this case with my flash chips  and it seems like an odd
thing to happen, but it may be possible with other NOR chips.

I will apply your patches and run some more tests over the weekend.

Power Cycling Results
---------------------------
I just finished a cycle test using 2 boards and the latest patches from
2.6.27 and both boards were still running after a combined power-cycle
count of 11,178 while running a memory torture test - so the robustness
has been improved dramatically.

Random LEB 1 Corruption
--------------------------------
Out of the 11,178 power cycles, I received 12 random corruptions of LEB
1.  The file system recovers the block.  The offset into the LEB where
the error occurs changes.  This looks like it may be the remaining half
of the problem I used to have where I would get two LEB 1's and one
would be corrupt.  Artem's fix of corrupting the header may have solved
the fatal issue, but I would be interested as to what this is doing.

Kernel mounts the UBIFS system as read-only initially and then remounts
it for write access later in the boot process.  This is the initial scan
in read-only mode.

[42949375.180000] UBIFS error (pid 1): ubifs_check_node: bad CRC:
calculated 0xb6a46e9c, read 0xd46336ef
[42949375.190000] UBIFS error (pid 1): ubifs_check_node: bad node at LEB
1:72704
[42949375.200000] UBIFS error (pid 1): ubifs_scanned_corruption:
corruption at LEB 1:72704
[42949375.250000] UBIFS error (pid 1): ubifs_scan: LEB 1 scanning failed
[42949375.310000] UBIFS: recovery needed
[42949375.430000] UBIFS: recovery deferred
[42949375.440000] UBIFS: mounted UBI device 0, volume 0, name "rootfs"
[42949375.440000] UBIFS: mounted read-only
[42949375.450000] UBIFS: file system size:   27498240 bytes (26853 KiB,
26 MiB, 210 LEBs)
[42949375.450000] UBIFS: journal size:       1178497 bytes (1150 KiB, 1
MiB, 9 LEBs)
[42949375.460000] UBIFS: media format:       w4/r0 (latest is w4/r0)
[42949375.470000] UBIFS: default compressor: lzo
[42949375.470000] UBIFS: reserved for root:  0 bytes (0 KiB)

Truncated Files
------------------
I keep a boot count in two separate files on the file system and update
them for every boot.  This count is getting reset when it gets up to 100
or so.  Here is a stripped down version of the boot count script.  Do
you see anything obvious?  I will do more analysis soon to determine the
exact cause, but just wanted to get everybody's opinion.

# set count = max(/var/boot_count_a, /var/boot_count_b)

count=$(($count+1))

# Write out the new boot count (do files that don't exist first)
if [ ! -e /var/boot_count_a ]; then
  # a doesn't exist, write it first
  echo $count > /var/boot_count_a
  sync
  echo $count > /var/boot_count_b
  sync
else
  echo $count > /var/boot_count_b
  sync
  echo $count > /var/boot_count_a
  sync
fi

-Eric