[PATCH v3 0/6] NAND BBM + BBT updates
Ricard Wanderlof
ricard.wanderlof at axis.com
Thu Jan 19 04:59:06 EST 2012
On Thu, 19 Jan 2012, Angus CLARK wrote:
> On 01/18/2012 10:04 PM, Brian Norris wrote:
>> On Tue, Jan 17, 2012 at 2:22 AM, Angus CLARK <angus.clark at st.com> wrote:
>>> (Indeed, this issue was raised recently in a meeting with one of the major NAND
>>> manufacturers, and the design enginner was horrified at the thought of relying
>>> on the OOB for tracking worn blocks.)
>>
>> That's interesting. I never had this impression, but perhaps the topic
>> just never came up.
>>
> Since it was first brought to our attention, we have sought clarification from a
> number of sources. The general consensus seems to be that if a block has gone
> bad, then one cannot rely on any further operations succeeding, including
> writing BB markers to the OOB area. However, the extent to which this is a
> problem in practice is less clear. Many of us have been using OOB BB markers
> for years without any issue, although perhaps we just haven't noticed!
As far as I understand, a block going bad during ordinary operation
basically means it is worn out to the point that the on-flash write
algorithm fails and responds with an error. So yes, that would mean that
in principle writes will fail so that the oob cannot be written. On the
other hand, that the chip reports a write error really means that it has
not managed to reliabily write all bits; it would seem unlikely that all
8 bits of the bad block marker byte in the oob would fail to get written
with zeros.
We ran a test on a 32 MB flash many years ago to get some sort of an idea
of what happens when a block 'wears out'. In that test, it was the erase
operation that failed first, and even then it was not an either-or
situation; a block where the chip reported an error during erase could
very well be erased successfully later. Furthermore, the number of erase
cycles was way above (with a factor 20 or so) above the endurance spec for
the chip - not surprising, since the specs by nature must be conservative.
What was more interesting was that the data retention at this state was
utterly lousy; writing to another block that had been written as many
times as the one that had 'failed' would have a retention time in the
region of hours before bits started flipping on subsequent reads.
What it all boiled down to was that for the failure mode we were seeing,
it was appearent that one cannot rely on the error status from the flash
to determine when a block is 'bad', one must have some form of erase
counter and proactively set a block as bad once the number of erase cycles
has reached a predetermined value (i.e. the endurance spec for the chip).
Furthermore, it was appearent that a block going bad is primarily a
wearing-out process, not something were a block suddenly 'goes bad' and is
unusable after that.
Of course, there are must likely other failure modes which we have not
observed, and also this was a couple of years ago; shrinking geometries
have affected the behavior of flash chips since then.
It would be interesting if anyone else has any hard data on this; the
flash manufacturers are usually not very forthcoming.
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
More information about the linux-mtd
mailing list