ubi on MLC nand flash

Sun Nov 6 12:35:28 EST 2011

On Sun, Nov 06, 2011 at 03:24:24PM +0000, Mike Dunn wrote:
> Hi everyone,
> 
> I recently started to do serious testing of UBI on the diskonchip G4 MLC nand
> driver I'm finishing up.  I started with the io_basic ubi test in mtd-utils. 
> What I find is that, after a few minutes, enough PEBs are marked as bad to
> exhaust the reserve PEB pool, UBI switches to r/o mode, and the test fails.  The
> reason is that - on this device at least - bit flips seem to be persistent;
> i.e., you will get e.g. 1 bit flip every time you read a certain page. 
> Consequently, when the bit flip occurs and the PEB gets scrubbed, the torture
> test fails because the bit flip reoccurs, and the PEB is marked bad.

Hi Mike,
I had the same results on recent (34 nm) SLC devices.

> I expected that eventually I might have to dig into the "program disturb",
> "read-disturb" or "paired pages" MLC issues, but the problem seems more
> fundamental.  My general impression is that UBI is too unforgiving for this
> device.  The ecc can correct up to 4 bit flips, so 1 bit flip seems to not be a
> big deal.  I'm new to UBI so this is not a critique or a proposal, I'm just
> hoping some experts can offer some advice or opinions.  The obvious remedy is to
> set a higher threshold for marking a PEB as bad, say 2 or 3 bit flips.

I discussed the matter with a nand manufacturer a while ago; the information I
could get (for SLC devices, not MLC) can be summarized as follows:

1. A block should be marked bad if a number of bitflips greater than what ecc
is able to correct has been detected after erase/program; or if the operation
failed with a status error

2. If the maximum number of correctable bitflips is reached during a read
operation, data should be relocated to another block, without marking the block
as bad

I could not get definitive information about the handling of persistent
bitflips, apart from the fact that they are expected and should not cause a
block to be marked as bad (as long as the ecc capability is not exceeded).
Most nand datasheets I had in my hands are also vague on the subject; they lack
a precise error handling strategy description for multi-bitflip devices.

Point 2 above seems reasonable as long as bitflips are reversible (i.e.
cancelled by an erase operation); but what if the maximum number of correctable
errors is reached during a read, those errors being caused by persistent
bitflips ? Should the block be considered bad (IMHO it should be scrubbed then
marked bad), or should data be simply relocated ?
When I asked the latter question to a nand manufacturer, his recommendation
was (quoting):
"(...) not to mark the block bad (because the error is correctable), and
to keep a copy of critical data in another location as backup" (!).

I suggest the following strategy:

Upon reading, when errors are detected (and corrected by ecc):
 - if (nb of errors <  ecc capability (*)) then no scrubbing, do nothing
 - if (nb of errors == ecc capability (*)) then
    - scrub block, then torture it and compute nb of persistent bitflips
    - if (nb of persistent errors <  ecc capability (*)) then block is OK
    - if (nb of persistent errors == ecc capability (*)) then mark block as bad
      [because a single additional bitflip (e.g. a read disturb) would cause
      data loss]

(*) In order to improve reliability, thresholds can be used instead of max ecc
capability.

I'm interested to hear opinions from mtd users/nand experts on the subject; I
know that at least a few of us had to implement ecc thresholds recently. And
UBI/mtd should be modified to support this (IIRC Artem was pushing in that
direction a while ago).

BR,
--
Ivan