UBI wear leveling / torture testing algorithms having trouble with MLC flash

Thu Apr 15 14:01:45 EDT 2010

Hi Artem,

> >  What's maybe 
> > needed is a torture test that understands the geometry and the bit correction 
> > level 1,4,8,12 etc and is able to toggle geometrically neighboring bits rather 
> > than view it as a simple memory test, but this may be difficult.
> 
> May be, but then I think this function should be moved down from UBI to
> the MTD level. Then the driver-level information like geometry may be
> used.

Do you mean having a low level MTD function that does it's own type of torture 
test on a block and returns test status to UBI?

> >  Perhaps 
> > torturing MLC blocks with only 3000 cycles is inappropriate anyways, why not 
> > just trust the error correction and only mark uncorrectable blocks as bad?.
> 
> May be. But on the other hand, ignoring soft-errors completely is not
> very good, as they may develop into hard-errors. Probably, as usually,
> we need a balance.

By MLC hard-errors, do you mean uncorrectable sectors, or do you mean permanently 
stuck bits that are consistently corrected by ECC every read? 

The permanently stuck bits I have often seen with MLC are typically stuck at 0 due to 
neighbouring program disturb effects from other cells. These same stuck bits are the 
reason for the torture test looping. If you look at the logs of the corrections, 
it typically involves resetting these back to 1. Perhaps the nand driver can recognize 
a stuck bit somehow and not report it as a correction but I think that's not right.

(Side-note: I also sometimes see a bit set to 0 after an erase operation with MLC.
E.g. you might see 0xfffffdff for example in the erased page. 
Since the ECC is also erased, and FF data does not imply FF ECC in our case, it is 
necessary to detect the all FF erased ECC and in software, fake these stuck low data 
bits as high as if the ECC were present...but I digress).

In the case of MLC, I think we might ignore soft errors and let the ECC do it's job. 
If the errors are program disturb effects as above, then they will be persistent and 
will always be corrected within the ECC capability. If we get too many stuck bits then 
the sector/page/block becomes bad and we know that without scrubbing or torturing 
anyways. If they are random errors due to noise margins and read-disturb effects, then 
I believe that this is also 'normal' behavior, and shouldn't trigger error recovery
operations like scrubbing/torturing, especially a single bit correction.

Also the programming during torturing might create more even more write disturb errors 
and make the problem even worse. But we have to live with that anyways during regular
programming for application data so that's probably a moot point.

> >  If I have a 
> > 3000 erase MLC part, then only 750 quick loops of the scrub/torture (4 erase) cycle 
> > will wear the block out.
> 
> This suggests you did not change the default UBI_WL_THRESHOLD = 4096.
> You should set it to something smaller. This will make the problem less
> severe, but will not fix it, of course.

So that explains why the loop eventually did stop. I originally had the threshold at
4096 but later switched to 256. In both cases we saw torture testing loops.

> How about improving UBI a little and just teach it avoid doing any
> scrubbing for eraseblocks with high enough erase-counter? Say, if UBI
> notices a bit-flip in eraseblock A, then:
> 
> if (EC of eraseblock A < min. EC + WL_FREE_MAX_DIFF / 2)
> 	do_scrubbing();
> else
> 	/* Do not do scrubbing for relatively "fresh" eraseblocks */
> 
> or something like that. This could be good enough to start with.
> 
> Also, torturing can be disabled or improved for MLC. This depends on how
> much efforts you want to invest into UBI over MLC.

Initially I think we might consider something like a config flag 
(e.g. CONFIG_MTD_UBI_SCRUB_AND_TORTURE) to just shut off the sensitivity 
to corrected errors, which are due to noise or the persistent stuck 
bits I described above. Rather than decide to do scrubbing 
based on the EC count as you suggest above, another way to look at it might 
be to simply not start the scrubbing operation based on normal ECC corrections. 
That's effectively what I'm doing today by hiding/reporting all ecc corrections 
as 0 corrections.

Later, we might look into starting the scrubbing when the corrections 
reaches a threshold based on the nand ECC correction limits. E.g. Set 
CONFIG_MTD_UBI_SCRUB_AND_TORTURE=y and also add "CONFIG_MTD_UBI_ECC_CAPABILITY=12", 
and then when we get close to 12 corrections (10,11,12?), we scrub, and perhaps 
torture this _much_ more marginal block, maybe take it out of service, etc. This 
could be combined with your erase-counter suggestion above.
The MLC parts could increase it to 4,8,12, etc. If it was not set, then the legacy 
behavior of UBI could still be used.

Future work might involve more elegant MTD based torture tests that understand 
flash geometry but I think that's a much tougher nut to crack. Especially since
there's no industry standard for MLC page/bit pairing algorithms by manufacturers.

These suggestions reduce the block erase frequency dramatically, and with 
3000 measly MLC erases, I don't see a problem restricting scrub/torture operations 
as a trade off for part lifetime.

Thanks.

Best regards,
Darwin

Disclaimer - Any views or opinions presented in this e-mail are solely those of the author 
and do not necessarily represent those of the company.