CONFIG_MTD_NAND_VERIFY_WRITE with Software ECC
Ricard Wanderlof
ricard.wanderlof at axis.com
Thu Feb 17 05:04:57 EST 2011
On Tue, 15 Feb 2011, David Peverley wrote:
> I did some more Googling and found a couple more interesting articles
> ; I don't know if you've read them (I assume you probably have!) but
> thought I'd repost for anyone else interested as this seems to be a
> really fascinating and not often discussed topic! :
> http://www.broadbandreports.com/r0/download/1507743~59e7b9dda2c0e0a0f7ff119a7611c641/flash_mem_summit_jcooke_inconvenient_truths_nand.pdf
>
> I found the following also really interesting, especially for the
> analysis of lots of devices with a plot of numbers of initial bad
> blocks marked which I'd always wondered about! :
> http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/40761/1/08-07.pdf
Thanks for the links. I have actually seen the first of these, but not the
second (which however in parts seems to draw heavily from the first one).
A couple of years ago I actually did some tests on a few 256 Mb SLC NAND
chips in order to better understand the failure modes. These were
specified as withstanding a maximum of 100 000 erase/program cycles.
Indeed, in one portion of the test, after 1.7 million erase/write cycles,
repetetive reading the same block would result in uncorrectable ECC
failures (i.e. two bit errors per 256 bytes) after 1.6 million read
cycles. This is, with some margin, in tune to Cooke's rule of thumb that
a maximum of 1 million read cycles should be done before rewriting the
block (the block in question started exhibiting correctable (i.e. 1-bit)
errors after a mere 40000 read cycles, but then again I was way over the
spec in terms of erase/write cycles).
One part of Nasa's analysis confused me. They attempt to erase factory
marked bad blocks and include them in the rest of the tests in order to
compare the characteristics of the known bad blocks with the rest of the
memory array.
However, data sheets often mention that one should not attempt to write or
erase bad blocks (e.g. the Samsung K9F1G08X0B data sheet simply says 'Do
not erase or program factory-marked bad blocks.'), and I've read
application notes explaining that bad blocks may have failures that can
cause unreliable operation of the rest of the chip if it is attempt to use
them. I.e. bad blocks do not necessarily just contain bad bit cells,
indeed there could be any other failure in the die causing irregular
operation when for instance internal erase voltages are applied to the
block.
Therefore, erasing bad blocks and including them in the test is basically
violating the device specification to start with.
The fact that bad blocks can not be replicated by simple testing is also
well documented by flash manufacturers, either in data sheets or
application notes. I remember reading that in order to determine whether a
block is bad, a number of tests are run at the factory, in much more
severe conditions that the average use case. I don't know if this just
involves testing at abnormal voltages and temperatures, or if the die is
also examined visually (or some other non-electric way) for errors prior
to being packaged, or if undocumented chip access modes are used during
the tests.
Indeed, in ST's application note AN1819 which you refer too (as sourced
from Micron, I think this is simply because Micron bought out Numonyx
which bought out ST's flash division, or however the exact economic
transactions went), mentions that "The invalid [i.e. bad] blocks are
detected at the factory during the testing process which involves severe
environmental conditions and program/erase cycles as well as proprietary
test modes. The failures that affect invalid blocks may not all be
recognized if methods different from those implemented in the factory are
used.".
So as far as I'm concerned it is well documented that it is not possible
to determine which blocks were marked bad at the factory from the user's
point of view.
Indeed, erasing or writing bad blocks could put the integrity of the whole
chip in question.
I second their conclusion however that most often bad blocks after erasure
seem to function just as well as good blocks. From time to time I've had
flash programmers screw up and mark a large portion of blocks as bad.
Erasing the whole chip, including factory-marked bad blocks, yields a
workable chip again, seemingly without any adverse affects at least in a
lab environment. I would not try and sell a unit whose factory-marked
flash bad blocks had been erased however, but it seems perfectly fine for
everyday use as a development platform.
I have yet to come across a chip with hard bad block errors, on the other
hand I can't say that I've come across more than a handful of chips who
have had their initial bad blocks erased.
>> If you have a block that is wearing out, due to a large number of
>> erase/write cycles, it will exhibit several failure modes at an increasing
>> rate, in particular long-term data loss. At some point one could argue that
>> the block is essentially worn out, and mark it as bad, even though an
>> erase/write/verify cycle actually might be successful. I don't think that is
>> what is happening in your case though.
> That's all to do with what one considers a "Bad Block" - I'd agree
> that the repeated failures can show that there might be an issue but
> all the literature I've read today state that only permanent failures
> are regarded as showing a bad block and these are reported via the
> flashes status read. In fact I found a Micron Appnote AN1819
In the torture testing I mentioned above on the 256 Mb chips, one part of
the test involved erase/write/read-cycling four blocks over and over again
until an erase or write failure was reported by the chip. In this
particular case, 3.7 million cycles were performed before mtd reported an
'I/O error' (probably due to a chip timeout) on erase.
However, after this, I performed a complete reflash of the chip, which
didn't result in any errors reported at all. A couple of more erase/write
attempts showed that sometimes an error during write would be reported,
sometimes not, never during erase (as had initially been the case).
The tortured block endured a mere 550 reads before the ECC 'correctable'
counter started increasing, and a mere 10000 reads before the ECC 'failed'
(2-bit errors) counter started increasing.
While this is a bit of an artificial test, being well over the specified
endurance specficiation, it does show a couple of things.
When an erase or write error is reported by the flash chip, the block in
question may be well over its specified erase/write endurance cycle
specification. So just using the chip's own failed erase or write
indication is not really sufficient or reliable. Indeed, as shown, after
3.7 million erase/write cycles, some writes still performed flawlessly,
but read performance was very lacking. Most likely read performance would
have been just as bad after say 3.5 million erase/write cycles, before any
write or erase errors had been reported by the chip.
Which all goes to say that as Cooke notes in his document above, it is
necessary for the software to keep track of the number of erase cycles,
and not just rely in the erase/write status that the chip itself reports.
> We're looking into these possibilities as well - However as is often
> the case, such problems provoke testing of less used code paths so
> it's quite a good thing to look at the right thing to do in this event
> in conjunction with fixing the root cause of the problem...
I agree.
Sorry for the long-winded reply. I'm not trying to prove I'm correct, but
I do want to share my experiences and value input from others, as there
doesn't seem to be a lot of hard information out there on these issues.
Besides, there seems to be a lot of misconceptions which I think can be
cleared up by discussion.
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
More information about the linux-mtd
mailing list