CONFIG_MTD_NAND_VERIFY_WRITE with Software ECC

Fri Feb 25 09:44:10 EST 2011

On Fri, Feb 25, 2011 at 12:12:10PM +0000, Artem Bityutskiy wrote:
> On Fri, 2011-02-25 at 12:36 +0100, Ivan Djelic wrote:
(...)
> > 
> > The fact that a bitflip detected during torture is enough to decide that a
> > block is bad causes problems on some 4-bit ecc devices we are using. If we
> > stick to this policy, we end up with a _lot_ of blocks being marked as bad
> > (i.e. way too many).
> 
> I see. May be in your case 1 bit errors are completely harmless, but 2
> and 3 are not?

When a NAND device requires 4-bit ecc or more, you do see a lot of 1-bit errors
(compared to previous NAND devices). They are not "completely harmless" because
you are still supposed to relocate data in some other block and erase the block
(those bitflips are reversible errors), in order to avoid error accumulation
and stay below the specified ecc requirement. But they probably should not be
considered an indication that the block has gone bad.

> > Our NAND manufacturer tells us that, as long as a block erase operation
> > completes without a failure reported by the device, it should not be classified
> > as bad, even if it has bitflips (which sounds risky at best).
> 
> For any amount of flipped bits per page? Sounds a bit scary.

I agree. Our NAND manufacturer even told us that a single permanent 1-bit
failure in a block is not enough for marking this block as bad on 4-bit ecc NAND
devices. I still think there should be a specified amount of errors above which
the block should be considered bad. Maybe only permanent bit failures should
be considered.

Just for information, in our case: 1-bit and 2-bit errors are not reported,
3-bit and above are reported. And we are able to correct up to 8 errors (while
the device only requires 4-bit correction), so we have some kind of safety
margin.

> > Right now, we implement a bitflip threshold, below which we correct ecc errors
> > without reporting them. When the bitflip threshold is reached, we report the
> > amount of corrected errors, triggering block scrubbing, etc.
> > This is not ideal, but it prevents UBI from torturing and marking too many
> > blocks as bad.
> 
> Hmm... Working around UBI behavior does not sound like a the best
> solution.

Agreed.

> How about changing the MTD interface a little and teach it to:
> 
> 1. Report the bit-flip level (or you name it properly) - the amount of
> bits flipped in this NAND page (or sub-page). If we read more than one
> NAND page at one go, and several pages had bit-flips of different level,
> report the maximum.

Yes, we do need the maximum error count per subpage. Today we only have a
cumulative count.

> 2. Make it possible for drivers to set the "bit-flip tolerance
> threshold" (invent a better name please), which is lowest the bit-flip
> level which should be considered harmful. E.g., in your case, the
> threshold could be 2.

This kind of threshold is NAND-device specific, ideally it could be derived
from ONFI + manufacturer information, in a driver-independent way. Ideally...

> 3. Make UBI only react on bit-flips with order higher or equivalend to
> the threshold. In your case then, UBI would ignore all level 1 bit-flips
> and react only to level 2, 3, and 4 bit-flips.

Yes, this makes sense. When you say "react", do you also mean not doing any
scrubbing when the error count is below the threshold ?
In future devices, 1-bit errors will become very common and we'll probably need
to ignore them to avoid scrubbing blocks all the time.

BR,

Ivan