[PATCH v3 0/6] NAND BBM + BBT updates

Thu Jan 19 04:59:06 EST 2012

On Thu, 19 Jan 2012, Angus CLARK wrote:

> On 01/18/2012 10:04 PM, Brian Norris wrote:
>> On Tue, Jan 17, 2012 at 2:22 AM, Angus CLARK <angus.clark at st.com> wrote:
>>> (Indeed, this issue was raised recently in a meeting with one of the major NAND
>>> manufacturers, and the design enginner was horrified at the thought of relying
>>> on the OOB for tracking worn blocks.)
>>
>> That's interesting. I never had this impression, but perhaps the topic
>> just never came up.
>>
> Since it was first brought to our attention, we have sought clarification from a
> number of sources.  The general consensus seems to be that if a block has gone
> bad, then one cannot rely on any further operations succeeding, including
> writing BB markers to the OOB area.  However, the extent to which this is a
> problem in practice is less clear.  Many of us have been using OOB BB markers
> for years without any issue, although perhaps we just haven't noticed!

As far as I understand, a block going bad during ordinary operation 
basically means it is worn out to the point that the on-flash write 
algorithm fails and responds with an error. So yes, that would mean that 
in principle writes will fail so that the oob cannot be written. On the 
other hand, that the chip reports a write error really means that it has 
not managed to reliabily write all bits; it would seem unlikely that all 
8 bits of the bad block marker byte in the oob would fail to get written 
with zeros.

We ran a test on a 32 MB flash many years ago to get some sort of an idea 
of what happens when a block 'wears out'. In that test, it was the erase 
operation that failed first, and even then it was not an either-or 
situation; a block where the chip reported an error during erase could 
very well be erased successfully later. Furthermore, the number of erase 
cycles was way above (with a factor 20 or so) above the endurance spec for 
the chip - not surprising, since the specs by nature must be conservative.

What was more interesting was that the data retention at this state was 
utterly lousy; writing to another block that had been written as many 
times as the one that had 'failed' would have a retention time in the 
region of hours before bits started flipping on subsequent reads.

What it all boiled down to was that for the failure mode we were seeing, 
it was appearent that one cannot rely on the error status from the flash 
to determine when a block is 'bad', one must have some form of erase 
counter and proactively set a block as bad once the number of erase cycles 
has reached a predetermined value (i.e. the endurance spec for the chip). 
Furthermore, it was appearent that a block going bad is primarily a 
wearing-out process, not something were a block suddenly 'goes bad' and is 
unusable after that.

Of course, there are must likely other failure modes which we have not 
observed, and also this was a couple of years ago; shrinking geometries 
have affected the behavior of flash chips since then.

It would be interesting if anyone else has any hard data on this; the 
flash manufacturers are usually not very forthcoming.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30