CONFIG_MTD_NAND_VERIFY_WRITE with Software ECC

Fri Feb 25 07:12:10 EST 2011

On Fri, 2011-02-25 at 12:36 +0100, Ivan Djelic wrote:
> On Fri, Feb 25, 2011 at 10:29:22AM +0000, Artem Bityutskiy wrote:
> (...)
> > Currently the mechanism to mark a block is bad is the torture function
> > failure: we write a pattern, read it back, compare, and do this several
> > times with different patterns. In case of any error in any step, or if
> > we read back something we did not write, or even if we get a bit-flip
> > when we read back the data, we bark the eraseblock as bad. Otherwise it
> > is returned to the pull of free eraseblocks.
> > 
> > See torture_peb() in drivers/mtd/ubi/io.c
> > 
> > This procedure is not ideal, and could be improved:
> > 
> > a) we could store amount of times the eraseblock was tortured. Since we
> > torture only if there was a write error, too many torture session would
> > indicate that the eraseblock is unstable.
> > b) we could take into account the erase count somehow.
> > 
> > But yes, the threshold would probably set up by the system designer at
> > the end.
> 
> The fact that a bitflip detected during torture is enough to decide that a
> block is bad causes problems on some 4-bit ecc devices we are using. If we
> stick to this policy, we end up with a _lot_ of blocks being marked as bad
> (i.e. way too many).

I see. May be in your case 1 bit errors are completely harmless, but 2
and 3 are not?

> Our NAND manufacturer tells us that, as long as a block erase operation
> completes without a failure reported by the device, it should not be classified
> as bad, even if it has bitflips (which sounds risky at best).

For any amount of flipped bits per page? Sounds a bit scary.

> Right now, we implement a bitflip threshold, below which we correct ecc errors
> without reporting them. When the bitflip threshold is reached, we report the
> amount of corrected errors, triggering block scrubbing, etc.
> This is not ideal, but it prevents UBI from torturing and marking too many
> blocks as bad.

Hmm... Working around UBI behavior does not sound like a the best
solution.

How about changing the MTD interface a little and teach it to:

1. Report the bit-flip level (or you name it properly) - the amount of
bits flipped in this NAND page (or sub-page). If we read more than one
NAND page at one go, and several pages had bit-flips of different level,
report the maximum.

2. Make it possible for drivers to set the "bit-flip tolerance
threshold" (invent a better name please), which is lowest the bit-flip
level which should be considered harmful. E.g., in your case, the
threshold could be 2.

3. Make UBI only react on bit-flips with order higher or equivalend to
the threshold. In your case then, UBI would ignore all level 1 bit-flips
and react only to level 2, 3, and 4 bit-flips.

Does this sound sensible?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)