UBI wear leveling / torture testing algorithms having trouble with MLC flash

Wed Apr 21 09:13:33 EDT 2010

On Thu, 2010-04-15 at 11:01 -0700, Darwin Rambo wrote:
> Hi Artem,
>  
> > >  What's maybe 
> > > needed is a torture test that understands the geometry and the bit correction 
> > > level 1,4,8,12 etc and is able to toggle geometrically neighboring bits rather 
> > > than view it as a simple memory test, but this may be difficult.
> > 
> > May be, but then I think this function should be moved down from UBI to
> > the MTD level. Then the driver-level information like geometry may be
> > used.
> 
> Do you mean having a low level MTD function that does it's own type of torture 
> test on a block and returns test status to UBI?

Something like that.

> > >  Perhaps 
> > > torturing MLC blocks with only 3000 cycles is inappropriate anyways, why not 
> > > just trust the error correction and only mark uncorrectable blocks as bad?.
> > 
> > May be. But on the other hand, ignoring soft-errors completely is not
> > very good, as they may develop into hard-errors. Probably, as usually,
> > we need a balance.
> 
> By MLC hard-errors, do you mean uncorrectable sectors, or do you mean permanently 
> stuck bits that are consistently corrected by ECC every read? 

I actually meant uncorrectable ECC errors.

> The permanently stuck bits I have often seen with MLC are typically stuck at 0 due to 
> neighbouring program disturb effects from other cells. These same stuck bits are the 
> reason for the torture test looping. If you look at the logs of the corrections, 
> it typically involves resetting these back to 1. Perhaps the nand driver can recognize 
> a stuck bit somehow and not report it as a correction but I think that's not right.

Hm, for me this sounds as an idea which may work. What if we just change
the semantics of the -EBADMSG error code of the MTD subsystem. Change it
from "bit flips occured and were corrected" to "_dangerous_ bit flips
occurred, were corrected, but it is risky, so the data should be
refreshed".

Then for SLC, every bit-flip can cause -EBADMSG, just like now, and for
MLC - the driver will be able to decide.

The idea is that MLCs are so different that only the driver can know
whether a bit-flip is dangerous or not.

> Also the programming during torturing might create more even more write disturb errors 
> and make the problem even worse. But we have to live with that anyways during regular
> programming for application data so that's probably a moot point.

This is another problem. We can disabling torturing for MLC, or change
it.

> > >  If I have a 
> > > 3000 erase MLC part, then only 750 quick loops of the scrub/torture (4 erase) cycle 
> > > will wear the block out.
> > 
> > This suggests you did not change the default UBI_WL_THRESHOLD = 4096.
> > You should set it to something smaller. This will make the problem less
> > severe, but will not fix it, of course.
> 
> So that explains why the loop eventually did stop. I originally had the threshold at
> 4096 but later switched to 256. In both cases we saw torture testing loops.
> 
> > How about improving UBI a little and just teach it avoid doing any
> > scrubbing for eraseblocks with high enough erase-counter? Say, if UBI
> > notices a bit-flip in eraseblock A, then:
> > 
> > if (EC of eraseblock A < min. EC + WL_FREE_MAX_DIFF / 2)
> > 	do_scrubbing();
> > else
> > 	/* Do not do scrubbing for relatively "fresh" eraseblocks */
> > 
> > or something like that. This could be good enough to start with.
> > 
> > Also, torturing can be disabled or improved for MLC. This depends on how
> > much efforts you want to invest into UBI over MLC.
> 
> Initially I think we might consider something like a config flag 
> (e.g. CONFIG_MTD_UBI_SCRUB_AND_TORTURE) to just shut off the sensitivity 
> to corrected errors, which are due to noise or the persistent stuck 
> bits I described above. Rather than decide to do scrubbing 
> based on the EC count as you suggest above, another way to look at it might 
> be to simply not start the scrubbing operation based on normal ECC corrections. 
> That's effectively what I'm doing today by hiding/reporting all ecc corrections 
> as 0 corrections.
> 
> Later, we might look into starting the scrubbing when the corrections 
> reaches a threshold based on the nand ECC correction limits. E.g. Set 
> CONFIG_MTD_UBI_SCRUB_AND_TORTURE=y and also add "CONFIG_MTD_UBI_ECC_CAPABILITY=12", 
> and then when we get close to 12 corrections (10,11,12?), we scrub, and perhaps 
> torture this _much_ more marginal block, maybe take it out of service, etc. This 
> could be combined with your erase-counter suggestion above.
> The MLC parts could increase it to 4,8,12, etc. If it was not set, then the legacy 
> behavior of UBI could still be used.
> 
> Future work might involve more elegant MTD based torture tests that understand 
> flash geometry but I think that's a much tougher nut to crack. Especially since
> there's no industry standard for MLC page/bit pairing algorithms by manufacturers.
> 
> These suggestions reduce the block erase frequency dramatically, and with 
> 3000 measly MLC erases, I don't see a problem restricting scrub/torture operations 
> as a trade off for part lifetime.

In general, I think all the MLC-specific things like "ECC 12" should be
hidden in the MTD level. Just because this information is too
MLC-specific for UBI to know about it.

UBI should distinguish MLC and behave a bit different in that case, but
this should be about "torture MLC eraseblocks this special way", and the
like. IOW, on higher level than knowing about ECC levels.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)