Is it an atomic operation for writing a page in NAND flash

Wed Jan 20 20:37:43 EST 2010

Very informative, it expand my view. Thanks!

2010/1/21 Ricard Wanderlof <ricard.wanderlof at axis.com>:
>
> On Wed, 20 Jan 2010, Liu Hui wrote:
>
>> Thanks you very much, CRC is the real solution.
>>
>> But I don't understand, if a partial write happens, we use ECC to
>> correct the data, we will find the data can't be corrected, then
>> -EBADMSG will be returned(see nand_correct_data()), then we can know
>> this page are corrupted. IMHO, this works.
>
> Assuming the ECC algorithm used by mtd, it only produces correct results in
> the case of 0, 1 or 2 bit errors in the data. For more bit errors than that,
> the result is undefined.
>
> Let's assume a partial write occurs, which leads to 57 bit errors compared
> to what was originally supposed to be there. Since there are more than 2 bit
> errors, the algorthm output is undefined; it may say that the data can't be
> corrected, or it may say that the data is ok, or it may say that the data
> can be corrected; it's impossible to tell. As far as I understand, it is not
> uncommon for ECC to say the data is correct when it fact it isn't.
>
> A slightly trivial case:
>
> Again assuming the ECC algorithm used by mtd, the ECC bytes for a chunk of
> data where all the bytes have the same value is 0xFFFFFF, regardless of the
> actual value. So, say you have a page full of 0xA3; the ECC is then
> 0xFFFFFFF. Now, assuming a partial write causes bit 2 of all bit cells to
> not change from 1 to 0 when programming. The result is a page full of 0xA7,
> in effect, 256 bit errors (assuming a page size of 256 bits, or at least,
> assuming an ECC calculation encompassing that many bytes). But the ECC will
> still be 0xFFFFFF, and the corresponding ECC calculation will say that the
> data is correct. That is, as I mentioned before, because the result of an
> ECC calculation on data with >2 bit errors is undefined.
>
> Note that there are other ECC algorithms which can correct more error bits.
> For MLC flash it is recommended to use an algorithm which corrects 4 bit
> errors rather than a single bit error in a block of data. Such algorithms
> require more ECC bits though.
>
> One has a tendency to think of ECC as a checking algorithm. It is not. It
> corrects and detects bit errors under certain circumstances. Outside those
> circumstances it is worthless. For the case of the software algorithm used
> in mtd, it is worthless if there are more than 2 bit errors. A failed write
> could cause any number of bit errors, so it is worthless to check the result
> using the ECC algorithm. The normal failure mode of a nand flash chip is
> random single random bit errors with low probability. ECC handles this
> elegantly.
>
>
> Elaborating on this slightly, a devil's advocate would considere ECC
> worthless as a correction algorithm for a flash chip. Assume one bit error
> occurs, which is corrected by ECC. Then another bit error occurs in the same
> page. ECC then detects a failure. Then another bit error occurrs. The ECC
> algorithm is now worthless, it may detect the error, it may say that the
> data is correct, or it may even try to correct it (erroneously).
>
> The reason that all this works in practice is that the probability of a bit
> error occurring is so low that the probability of two bit errors occurring
> in the same page is very low, in some respect lower than other failure modes
> in the system, so that we don't have to worry about it. (For example: What
> about bit errors occurring in RAM chips from cosmic radiation? It is a real
> risk, but so small that most systems don't have to worry about it.)
>
> It can be a real concern though, and that is why things like UBI provide
> so-called bit scrubbing: whenever it detects that ECC has done a bit
> correction in a block, it erases that block and rewrites the data [lots of
> details omitted here] so that the chance of two bit errors ever occurring in
> the same page will be very small indeed.
>
> Especially in these days of larger and larger flash chips resulting from
> shrinking chip geometries this is problem that is getting worse and worse.
> It also tends to vary hugely among manufacturers.
>
> /Ricard
> --
> Ricard Wolf Wanderlöf                           ricardw(at)axis.com
> Axis Communications AB, Lund, Sweden            www.axis.com
> Phone +46 46 272 2016                           Fax +46 46 13 61 30
>

-- 
Thanks & Best Regards
Liu Hui
--