Distinguishing bitflips due to read-disturb or due to wear-out
Ricard Wanderlof
ricard.wanderlof at axis.com
Sun Mar 18 16:56:45 EDT 2012
On Sun, 18 Mar 2012, Shmulik Ladkani wrote:
>> The bigger issue is how to discern whether the degredation is due to
>> read-disturb (which can be recovered by erasing/reprogramming the block)
>> or the page physically wearing out (in which case it needs to be
>> retired).
>
> Question is, do we really need to distinguish between the two?
>
> If there is a "dangerously high" number of bit errors, then scrubbing
> should be performed.
> If the reason for the bit errors was due to read-disturb, then those
> error are gone after scrubbing (for now, until read-disturb affects
> again).
> If the reason was wear-out, then it is likely that high number of bit
> errors will be evident, again. But if the block is totally worn-out,
> shouldn't the device return an error status for the erase operation,
> eventually? (and as such, the MTD software will retire the block)?
Just my $0.02 worth ... in my experience, what happens as a block wears
out is that the frequency / probability of read disturb errors increases,
but there are not necessarily bit cells which have failed completely.
I once did a test on a single 32 Mbyte NAND flash device, by torturing a
block over and over again until the erase or write operation failed
according to the device. The guaranteed specified maximum erase/write
cycle count was 100000 (quite typical for SLC flash). In a particular
case, the erase operation returned an error status first when 2.3 million
cycles had been performed. In this state, it took about fifteen minutes
after writing until a bit flipped during read. However, immediately
reading back the data after writing showed no errors. Also, even after the
erase operation returned an error the first time, several more write/erase
cycles were completed with no errors.
From this can be concluded that the 'wearing out' issue is gradual and not
a sudden transition into a 'failed' status. Also, when the chip begins
reporting errors during erase and write, the blocks may long have passed
their erase cycle specification. Finally, the 'read disturb' errors are
probably of the same nature independent of whether the blocks are new or
have been used a lot, it's just the error probability that increases.
Since it is not really possible to determine a block is by reading back
the data immedately after write, it would seem that any form of mechanism
that attempts to track how bad the block is, should either monitor the
erase cycle count for each block, and retire the block after a certain
number of cycles have been reached (say, per the specifications), or try
to keep some sort of statistic on how old the data is, i.e. for each read,
how long ago the data was written, and how many read cycles have been
performed since then, in order to determine how serious the 'read disturb'
errors (bitflips) are.
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
More information about the linux-mtd
mailing list