Ricard Wanderlof ricard.wanderlof at
Tue Feb 15 10:01:58 EST 2011

On Tue, 15 Feb 2011, David Peverley wrote:

> Thats a good analysis ; I've dug around as a result and found an
> interesting tech note from Micron at :

Yes, that's a good (and one of the few) notes on failure mechanisms.

> They classify NAND failures into various types ; Permanent Failures
> and Temporary Failures, with the temporary failures being split into
> "Program Disturb", "Read disturb", "Over-programming" and "Data loss"
> ...
> From the description of read disturb, it occurs due to many reads
> (hundreds of thousands or millions) prior to an erase. Currently my
> testing is using nandtestc.c and mtd_stresstest.ko - the former tests
> one cycle before re-programming and the latter is random but not
> expected to be more than tens of reads before a re-programme becomes
> statistically likely.

I agree, it doesn't sound like that is your problem.

> Potentially program disturb sounds like it _could_ be the behaviour I 
> observe but it's not clear.

It seems to fit in with the description, i.e. you get bits programmed that 
were not intended to be programmed. It seems to get worse when not 
programming whole pages at once (partial page programming).

> My general take on this is that only the permanent type failures i.e. 
> those involving permanently stuck bits, require marking as bad blocks. 
> The recovery recommended for the other scenarios is always to erase and 
> re-programme. This potentially opens up a whole can of worms... My 
> interpretation of this is that if we verify a write and we've had a 
> (correctable and non-permanent) single bit error the Right Thing To Do 
> would be to erase and re-programme the block, probably with a very small 
> retry limit. We could argue that it's the responsibility of the 
> file-system to do this but programatically I think nand_write_page() is 
> best placed to be able to do this.

Either the file system, or some flash management layer such as UBI should 
take care of this, but I know that jffs2 doesn't at any rate. I'd agree it 
makes sense for the lower level to to the best of its capability guarantee 
that the data written actually does get written on the flash.

I think that the application note in some respects simplifies matters a 
bit. If you have a block that is wearing out, due to a large number of 
erase/write cycles, it will exhibit several failure modes at an increasing 
rate, in particular long-term data loss. At some point one could argue 
that the block is essentially worn out, and mark it as bad, even though an 
erase/write/verify cycle actually might be successful. I don't think that 
is what is happening in your case though.

> Certainly the verify failures we see here with a raw read are
> occasional (and not consistently the same blocks) and hence not
> indicative of stuck bits and generally after the block is re-written
> the read is correct. What do you reckon?

It would seem that if you have only occasional faults in arbitrary blocks 
it wouldn't be a wear problem; if the blame is with the flash I would 
agree it fits in with the 'Program Disturb' description. I must admit I've 
not come across this type of error myself, but that could be because of 
limited experience or that it occurs extremely infrequently in the types 
of flash that I've been exposed to so I've never noticed it.

I don't know if anyone else here on this list has any experiences to 
share. Frankly, if I saw errors of that type I would start looking for 
hardware problems, or some sort of hardware or software induced contention 
on the flash chip bus. Not that that would necessarily be the right 
approach, but I've seen errors of that type occurring as the result of 
out-of-spec level shifters between the MCU and flash chip, or incorrectly 
set up bus timing towards the flash.

Ricard Wolf Wanderlöf                           ricardw(at)
Axis Communications AB, Lund, Sweden  
Phone +46 46 272 2016                           Fax +46 46 13 61 30

More information about the linux-mtd mailing list