UBI wear leveling / torture testing algorithms having trouble with MLC flash

Wed Apr 21 12:51:42 EDT 2010

Hi Artem,

> > The permanently stuck bits I have often seen with MLC are typically stuck at 0 due to
> > neighbouring program disturb effects from other cells. These same stuck bits are the
> > reason for the torture test looping. If you look at the logs of the corrections,
> > it typically involves resetting these back to 1. Perhaps the nand driver can recognize
> > a stuck bit somehow and not report it as a correction but I think that's not right.
>
> Hm, for me this sounds as an idea which may work. What if we just change
> the semantics of the -EBADMSG error code of the MTD subsystem. Change it
> from "bit flips occured and were corrected" to "_dangerous_ bit flips
> occurred, were corrected, but it is risky, so the data should be
> refreshed".
>
> Then for SLC, every bit-flip can cause -EBADMSG, just like now, and for
> MLC - the driver will be able to decide.
>
> The idea is that MLCs are so different that only the driver can know
> whether a bit-flip is dangerous or not.

When bit flips are corrected we return -EUCLEAN and when uncorrectable errors occur we
return -EBADMSG. So you probably mean -EUCLEAN above? I think -EBADMSG should continue
to mean "uncorrectable".

But a permanently stuck bit is corrected each read and isn't really dangerous.
It's most likely just a write-disturb effect. I have seen that blocks that get heavily
erased & programmed start showing lots of correctable ECC errors. So I don't think we
should consider stuck bits or random bit flips as dangerous. But we should consider
high numbers of errors of both types together as a dangerous condition, likely indicating
block wearout.

>
> > Also the programming during torturing might create more even more write disturb errors
> > and make the problem even worse. But we have to live with that anyways during regular
> > programming for application data so that's probably a moot point.
>
> This is another problem. We can disabling torturing for MLC, or change
> it.

Torturing MLC flash just wears the MLC flash out more and creates more bit flips. But
scrubbing and torturing are different things, so we are scrubbing so that we increase
the probability of precious user data not being lost, disabling scrubbing is not a good
option, but disabling/changing torturing may be fine. Perhaps we should have MTD/MLC return
-EUCLEAN when say, 80-100% of the max possible corrections are done and then scrub the
data to a good block, and then either torture the marginal block or mark it bad and remove
the block from service?

Another idea might be to remember with a torture histogram how many times a each block
was sent out for torturing, and after N (3?) tortures, something is definitely bad, and
then the block could be marked bad permanently.

For MLC, you might consider just erasing the block and not torturing at all. Eventually the
marginal block will hit the N threshold above and be taken out of service anyways. Then you
don't need a mtd specific torture test at all...

>
>
>
> In general, I think all the MLC-specific things like "ECC 12" should be
> hidden in the MTD level. Just because this information is too
> MLC-specific for UBI to know about it.
>
> UBI should distinguish MLC and behave a bit different in that case, but
> this should be about "torture MLC eraseblocks this special way", and the
> like. IOW, on higher level than knowing about ECC levels.

I agree the mtd driver should hide this information if possible, and we suggested that
above for error reporting. But for torturing, that might mean we need to change the mtd
interface to be able to request a flash specific torture test and return a status.
Changing mtd seems to be harder than the suggestions above, and since torturing may make
things worse, I think we should try to keep our solution in the ubi layer for now.

Thanks.

Best regards,
Darwin

Disclaimer - Any views or opinions presented in this e-mail are solely those of the author
and do not necessarily represent those of the company.