RFC: detect and manage power cut on MLC NAND

Mon Mar 23 18:17:34 PDT 2015

Hi Jeff,

thanks for the info. That sounds like very good news
to me - it appears that the paired page problem isn't
as bad as I thought.

> Lvl    LH
> ===========
> 1.0 => 11
> 0.7 => 01
> 0.3 => 10
> 0.0 => 00
         ||
Is this  HL?

This seems to match my second scenario (large step
for low page) - we do not lose the low page when
the high page write fails. Here is what I understood:

The low page distribution centres around 1.0 and 0.3,
with a spread of 0.7. Reading the low page as MLC at
this point will classify everything above 0.5 as 1
and below as 0.

So the distance of a '1' from the threshold is 0.5,
and the distance of a '0' is 0.2. It's asymmetric,
with SLC distance on one side and MLC on the other.

The high page write (only when writing a '0'), moves
the charge towards 0.0, by 0.3. So if the first page
contains '1', the charge is lowered to 0.7, if the
first page contains '0' the charge goes down to 0.0.

At no time is the charge crossing the 0.5 threshold.
An aborted high page write will have reduced the '1'
distance by 0.0-0.3 or increased the '0' distance
by 0.0-0.3.

Why are we worried about the low page at all? The way
I understand the gate charge changes, the low page
situation can not be, at any point, worse than the
normal programmed MLC state.

Losing the high page isn't conceptually any worse
than aborted writes on SLC. This has been called
the "unstable bits problem".

As far as I understand, UBI/UBIFS currently don't
mitigate for aborted write. The worry here is that an
aborted write (SLC or MLC) gets you a page where the
0s are weak - they may pass ECC on read now but fail
next month. Or if the abort was early, weaken some
1s in an erased-looking page.

Aborted erases are handled by UBI via the EC header
write immediately after erase. An erased block without
an EC header is erased again at boot time, to avoid
partially erased blocks (weak 1s).

Aborted writes are harder. UBI could deal with its
own headers via sequence number and always rewrite
the last block when it has UBI headers but no payload.
Seems a bit wasteful, though.

UBIFS also has an idea which few blocks may have
received a power cut during the last write - I think
around 5 candidates or so. Rewriting them all every
mount time seems even more wasteful.

Then of course, the rewrite can also be aborted, so
UBI's last atomic LEB change probably also needs
redoing at attach time.

Jeff, what's the situation with aborted writes?
Is this problem real? I understand that there are
charge pumps on board the NAND chip - do they
store sufficient energy to complete a page write?

What about the typical system, which has, say, a
3.3V supply rail and a system reset firing when
that falls below 3.0V or so. The processor core
and NAND operate at 1.8V, so there are a few ms
between the last possible NAND command and the
breakdown of the NAND supply.

Would that be safe from aborted writes?

Best regards,

Iwo

________________________________________
From: Jeff Lauruhn (jlauruhn) [jlauruhn at micron.com]
Sent: Tuesday, 24 March 2015 8:15 AM
To: Iwo Mergler; Richard Weinberger; dedekind1 at gmail.com
Cc: Andrea Scian; mtd_mailinglist; Qi Wang 王起 (qiwang)
Subject: RE: RFC: detect and manage power cut on MLC NAND

This is a very simplified description, but actually it's more like this:

First pass, program the lower page.  If you the lower page is 1, do nothing.  If the lower page is 0 subtract 0.7v to 0.3.  Lower page is SLC like, two distributions spread apart by 0.7V.

Lvl    LH
===========
1.0 => 1u
0.3 => 0u

Now, program the upper page.  First, read lower page, if lower page is 1 and upper page is 1, do nothing (11).  If lower page is 1 and upper page is 0, then subtract -0.3 and call that 01.  Next if lower page is 0 and upper page is 1 do nothing and if lower page is 0 and upper page is 0 subtract 0.3v and call it 00.  Notice that state of lower page is on right of 11, 01,10, 00.

Lvl    LH
===========
1.0 => 11
0.7 => 01
0.3 => 10
0.0 => 00

Now what happens if there's a power loss during the programming of the upper page?  The upper page data will most likely be lost, and the lower page may be changed, but there's a good chance of recovery, because it will be in the range of SLC.  It is highly recommended to read and refresh data after a power loss.

Jeff Lauruhn
NAND Application Engineer
Embedded Business Unit

-----Original Message-----
From: Iwo Mergler [mailto:Iwo.Mergler at netcommwireless.com]
Sent: Sunday, March 22, 2015 9:09 PM
To: Richard Weinberger; dedekind1 at gmail.com
Cc: Andrea Scian; mtd_mailinglist; Jeff Lauruhn (jlauruhn); Qi Wang 王起 (qiwang)
Subject: RE: RFC: detect and manage power cut on MLC NAND

Hi all,

I probably don't know enough about the silicon implementation of MLC paired pages, but my feeling is that there should be a way to recover one of the pages if the paired write fails, at least in some cases.

Something along the lines of using both bits to map to a single good one.

2 bit MLC stores 4 levels - 1.0, 0.7, 0.3, 0.0. Obviously, the actual voltage levels will be somewhat different, so take this as electrons on the floating gate: 1.0=minimum, 0.0 maximum.

I imagine that there are two ways to achieve that - small step for low page and large step for high page, or the other way 'round.

Assuming the first, the low page write would subtract 0.3 from the erased (1.0) cell if the bit is 0. That leaves the cell at either ~1.0 (1) or 0.7 (0).

Lvl    LH
===========
1.0 => 1u
0.7 => 0u

Then, the high page write would subtract either nothing (1) or
0.7 (0):

Lvl    LH
===========
1.0 => 11
0.7 => 01
0.3 => 10
0.0 => 00

So the MLC decoder logic gets 3 priority encoded bits from the sense amplifiers: 111, 011, 001, 000. The decoder turns this into 11, 01, 10, 00.

The process of writing a 0 to the high page, transitions low page 0-bits through 1 and back to 0, as the level moves down.

Low page 1 bits transition from 1 through 0 and back to 1.

So a half-completed high page 0-write can flip a low page bit both ways.

We can detect an incorrect 0-1 transition in the low page, because it's marked by a 0 bit in the high page.

We can't detect an incorrect 1-0 transition in the low page.

So assuming a failed high page write, this is what we get:

LH

11 = nothing happens, reads back as 11
     Correct level for both.

01 = Level stays at 0.7, reads back as 01,
     Correct level for low page.

10 = Level between 1.0 and 0.3, reads back as 11, 01 or 10.
     01 is wrong for low page, but can't be distinguished from 10.

00 = Level between 0.7 and 0.0, reads back as 01, 10, or 00.
     10 is wrong for low page, but can be distinguished from 01.

So, there are two bit combinations (50%) that have an undetectable failure, and this failure will happen about half the time, for a total of 25% unfixable failure rate.

Not acceptable in the general case, but might be good enough for things like UBI EC & VID headers, if we ensure that the high page contains 1s at the offsets at which the low page stores the header.

Now, on the other hand, if the low page write uses the larger step, there shouldn't be any paired page problem at all, since the high page write wouldn't cross the low page thresholds on the way:

Lvl    LH
===========
1.0 => 1u
0.3 => 0u

Lvl    LH
===========
1.0 => 11
0.7 => 10
0.3 => 01
0.0 => 00

Which makes me think I'm misunderstanding something. If not, why isn't his scheme used in the first place?

What would happen if we reverse the paired page writing order?
Not recommended, we want pages programmed in sequence to mitigate disturbs and obtain the highest reliability.

Jeff, Qi, is the mechanism I described here anywhere near reality?
It's a simplified view, but fairly accurate.

Best regards,

Iwo