CONFIG_MTD_NAND_VERIFY_WRITE with Software ECC

Thu Feb 17 05:04:57 EST 2011

On Tue, 15 Feb 2011, David Peverley wrote:

> I did some more Googling and found a couple more interesting articles
> ; I don't know if you've read them (I assume you probably have!) but
> thought I'd repost for anyone else interested as this seems to be a
> really fascinating and not often discussed topic! :
>  http://www.broadbandreports.com/r0/download/1507743~59e7b9dda2c0e0a0f7ff119a7611c641/flash_mem_summit_jcooke_inconvenient_truths_nand.pdf
>
> I found the following also really interesting, especially for the
> analysis of lots of devices with a plot of numbers of initial bad
> blocks marked which I'd always wondered about! :
>  http://trs-new.jpl.nasa.gov/dspace/bitstream/2014/40761/1/08-07.pdf

Thanks for the links. I have actually seen the first of these, but not the 
second (which however in parts seems to draw heavily from the first one).

A couple of years ago I actually did some tests on a few 256 Mb SLC NAND 
chips in order to better understand the failure modes. These were 
specified as withstanding a maximum of 100 000 erase/program cycles. 
Indeed, in one portion of the test, after 1.7 million erase/write cycles, 
repetetive reading the same block would result in uncorrectable ECC 
failures (i.e. two bit errors per 256 bytes) after 1.6 million read 
cycles. This is, with some margin, in tune to Cooke's rule of thumb that 
a maximum of 1 million read cycles should be done before rewriting the 
block (the block in question started exhibiting correctable (i.e. 1-bit) 
errors after a mere 40000 read cycles, but then again I was way over the 
spec in terms of erase/write cycles).

One part of Nasa's analysis confused me. They attempt to erase factory 
marked bad blocks and include them in the rest of the tests in order to 
compare the characteristics of the known bad blocks with the rest of the 
memory array.

However, data sheets often mention that one should not attempt to write or 
erase bad blocks (e.g. the Samsung K9F1G08X0B data sheet simply says 'Do 
not erase or program factory-marked bad blocks.'), and I've read 
application notes explaining that bad blocks may have failures that can 
cause unreliable operation of the rest of the chip if it is attempt to use 
them. I.e. bad blocks do not necessarily just contain bad bit cells, 
indeed there could be any other failure in the die causing irregular 
operation when for instance internal erase voltages are applied to the 
block.

Therefore, erasing bad blocks and including them in the test is basically 
violating the device specification to start with.

The fact that bad blocks can not be replicated by simple testing is also 
well documented by flash manufacturers, either in data sheets or 
application notes. I remember reading that in order to determine whether a 
block is bad, a number of tests are run at the factory, in much more 
severe conditions that the average use case. I don't know if this just 
involves testing at abnormal voltages and temperatures, or if the die is 
also examined visually (or some other non-electric way) for errors prior 
to being packaged, or if undocumented chip access modes are used during 
the tests.

Indeed, in ST's application note AN1819 which you refer too (as sourced 
from Micron, I think this is simply because Micron bought out Numonyx 
which bought out ST's flash division, or however the exact economic 
transactions went), mentions that "The invalid [i.e. bad] blocks are 
detected at the factory during the testing process which involves severe 
environmental conditions and program/erase cycles as well as proprietary 
test modes. The failures that affect invalid blocks may not all be 
recognized if methods different from those implemented in the factory are 
used.".

So as far as I'm concerned it is well documented that it is not possible 
to determine which blocks were marked bad at the factory from the user's 
point of view.

Indeed, erasing or writing bad blocks could put the integrity of the whole 
chip in question.

I second their conclusion however that most often bad blocks after erasure 
seem to function just as well as good blocks. From time to time I've had 
flash programmers screw up and mark a large portion of blocks as bad. 
Erasing the whole chip, including factory-marked bad blocks, yields a 
workable chip again, seemingly without any adverse affects at least in a 
lab environment. I would not try and sell a unit whose factory-marked 
flash bad blocks had been erased however, but it seems perfectly fine for 
everyday use as a development platform.

I have yet to come across a chip with hard bad block errors, on the other 
hand I can't say that I've come across more than a handful of chips who 
have had their initial bad blocks erased.

>> If you have a block that is wearing out, due to a large number of
>> erase/write cycles, it will exhibit several failure modes at an increasing
>> rate, in particular long-term data loss. At some point one could argue that
>> the block is essentially worn out, and mark it as bad, even though an
>> erase/write/verify cycle actually might be successful. I don't think that is
>> what is happening in your case though.
> That's all to do with what one considers a "Bad Block" - I'd agree
> that the repeated failures can show that there might be an issue but
> all the literature I've read today state that only permanent failures
> are regarded as showing a bad block and these are reported via the
> flashes status read. In fact I found a Micron Appnote AN1819

In the torture testing I mentioned above on the 256 Mb chips, one part of 
the test involved erase/write/read-cycling four blocks over and over again 
until an erase or write failure was reported by the chip. In this 
particular case, 3.7 million cycles were performed before mtd reported an 
'I/O error' (probably due to a chip timeout) on erase.

However, after this, I performed a complete reflash of the chip, which 
didn't result in any errors reported at all. A couple of more erase/write 
attempts showed that sometimes an error during write would be reported, 
sometimes not, never during erase (as had initially been the case).

The tortured block endured a mere 550 reads before the ECC 'correctable' 
counter started increasing, and a mere 10000 reads before the ECC 'failed' 
(2-bit errors) counter started increasing.

While this is a bit of an artificial test, being well over the specified 
endurance specficiation, it does show a couple of things.

When an erase or write error is reported by the flash chip, the block in 
question may be well over its specified erase/write endurance cycle 
specification. So just using the chip's own failed erase or write 
indication is not really sufficient or reliable. Indeed, as shown, after 
3.7 million erase/write cycles, some writes still performed flawlessly, 
but read performance was very lacking. Most likely read performance would 
have been just as bad after say 3.5 million erase/write cycles, before any 
write or erase errors had been reported by the chip.

Which all goes to say that as Cooke notes in his document above, it is 
necessary for the software to keep track of the number of erase cycles, 
and not just rely in the erase/write status that the chip itself reports.

> We're looking into these possibilities as well - However as is often
> the case, such problems provoke testing of less used code paths so
> it's quite a good thing to look at the right thing to do in this event
> in conjunction with fixing the root cause of the problem...

I agree.

Sorry for the long-winded reply. I'm not trying to prove I'm correct, but 
I do want to share my experiences and value input from others, as there 
doesn't seem to be a lot of hard information out there on these issues. 
Besides, there seems to be a lot of misconceptions which I think can be 
cleared up by discussion.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30