ubi vol_size and lots of bad blocks

Mon Oct 10 08:09:19 EDT 2011

Hi,

We're still working on getting ubifs shipped on OLPC XO-1.

One outstanding issue we have is that on some laptops, when switching
from jffs2 to ubifs, the laptop simply does not boot (root fs mounting
difficulties).

One case of this is when there are a large number of bad blocks on the
disk, during boot we get:
[   76.855427] UBI error: vtbl_check: too large reserved_pebs 7850,
good PEBs 7765
[   76.867878] UBI error: vtbl_check: volume table check failed:
record 0, error 9

With so many bad blocks, this is likely a problematic nand or a
corrupt BBT. However, jffs2 worked in this situation, and (with many
of our laptops in remote places) it would be nice for us to figure out
how to make ubifs handle it as well.

There are other cases of this error in the archive, and people have
generally solved it by using a smaller vol_size in the ubinize config.
Am I right in saying that reserved_pebs is computed from the vol_size
specified in the ubinize config?

I guess "good PEBs" is calculated from the amount of non-bad blocks
found during the boot process.

This suggests that using vol_size is unsafe for installations such as
ours, where while we do know the NAND size in advance, we also want to
support an unknown, high number of bad blocks which will vary
throughout the field.

I found a note in the UBI FAQ where it says vol_size can be excluded
and it will be computed to be the size of the input image, and then
the autoresize flag can be used to expand the partition later.
Excluding vol_size in this way indeed solves the problem and the
problematic laptop now boots.

So, am I right in saying that for an installation such as OLPC, where
resilience to strange NAND conditions involving high numbers of bad
blocks is desired, it is advisable to *not* specify vol_size in
ubinize.cfg?

(If so I'll send in a FAQ update for the website.)

The one bit I don't understand is what happens if another block goes
bad later. If the autoresize functionality has modified reserved_pebs
to represent the exact number of good blocks on the disk (i.e.
reserved_pebs==good_PEBs), next time a block goes bad the same
reserved_pebs>good_PEBs boot failure would be hit again. But I am
probably missing something.

cheers,
Daniel