Distinguishing bitflips due to read-disturb or due to wear-out

Sun Mar 18 16:56:45 EDT 2012

On Sun, 18 Mar 2012, Shmulik Ladkani wrote:

>> The bigger issue is how to discern whether the degredation is due to
>> read-disturb (which can be recovered by erasing/reprogramming the block)
>> or the page physically wearing out (in which case it needs to be
>> retired).
>
> Question is, do we really need to distinguish between the two?
>
> If there is a "dangerously high" number of bit errors, then scrubbing
> should be performed.
> If the reason for the bit errors was due to read-disturb, then those
> error are gone after scrubbing (for now, until read-disturb affects
> again).
> If the reason was wear-out, then it is likely that high number of bit
> errors will be evident, again. But if the block is totally worn-out,
> shouldn't the device return an error status for the erase operation,
> eventually? (and as such, the MTD software will retire the block)?

Just my $0.02 worth ... in my experience, what happens as a block wears 
out is that the frequency / probability of read disturb errors increases, 
but there are not necessarily bit cells which have failed completely.

I once did a test on a single 32 Mbyte NAND flash device, by torturing a 
block over and over again until the erase or write operation failed 
according to the device. The guaranteed specified maximum erase/write 
cycle count was 100000 (quite typical for SLC flash). In a particular 
case, the erase operation returned an error status first when 2.3 million 
cycles had been performed. In this state, it took about fifteen minutes 
after writing until a bit flipped during read. However, immediately 
reading back the data after writing showed no errors. Also, even after the 
erase operation returned an error the first time, several more write/erase 
cycles were completed with no errors.

From this can be concluded that the 'wearing out' issue is gradual and not 
a sudden transition into a 'failed' status. Also, when the chip begins 
reporting errors during erase and write, the blocks may long have passed 
their erase cycle specification. Finally, the 'read disturb' errors are 
probably of the same nature independent of whether the blocks are new or 
have been used a lot, it's just the error probability that increases.

Since it is not really possible to determine a block is by reading back 
the data immedately after write, it would seem that any form of mechanism 
that attempts to track how bad the block is, should either monitor the 
erase cycle count for each block, and retire the block after a certain 
number of cycles have been reached (say, per the specifications), or try 
to keep some sort of statistic on how old the data is, i.e. for each read, 
how long ago the data was written, and how many read cycles have been 
performed since then, in order to determine how serious the 'read disturb' 
errors (bitflips) are.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30