Questions about NAND (double)bit errors

Wed Feb 8 17:26:10 EST 2006

On Friday 03 February 2006 00:12, Wolfgang Mües wrote:
> Hello,
>
> I want to use JFFS2/MTD in an embedded Linux device with frequent
> writes (worst case is 15 KBytes per 10 seconds, typical case is less than
> 10% of the worst case). The device will be a 512 MBit NAND SLC type from
> Hynix, Samsung or STM. We have a working prototype, and we have read many
> NAND flash papers available on the net, and the recent MTD mailing list
> archives.
>
> Beside of wear leveling questions, there are program disturb errors
> (programming a page flips a bit in another page) and read disturb errors
> (reading a page flips a bit). Rates for these single-bit-errors are
> available in publications from M-systems and Toshiba.
>
> But since single bit errors are easily corrected by ECC, I am more
> interested in errors where more than 1 bit is flipped in a 256 byte ECC
> area. We cannot calculate these error numbers from the single bit errors
> because we don't know if these errors are unrelated to each other.

If you have not already done so, read the Toshiba NAND flash application 
guide:
http://www.dataio.com/pdf/NAND/Toshiba/NandDesignGuide.pdf.pdf

that might give some further info.

>
> Is there any information available to estimate/calculate the remaining
> errors after ECC correction? Or is there any information about first hand
> experience of NAND stress tests or other real world experience?
>
> Maybe the NAND project is terminated if I don't find anything about
> practical reliability...

I have not used JFFS2, but I have done extensive testing with YAFFS. At the 
NAND level they should be about the same.

I have done a few accelerated lifetime tests that have gone very well. In one 
test (run once on 512byte page devices and once on 2k page devices) I wrote, 
read back and verified over 120Gbytes of data to the fs without a single bit 
betting lost. Other people did similar tests too. This was on non-Linux 
devices, but that's not material at the NAND level.

From my observations NAND is very reliable and is getting more reliable all 
the time.

There are at least two factor that might be different for JFFS2 vs YAFFS:
* Most flash reliability is specified based on an assumption that you perform 
a maximum number of writes per page. I don't know what JFFS2 does, but YAFFS 
does one major write and then writes a single byte deletion marker to the OOB 
area when the page is discarded. YAFFS2 does not write deletion markers. This 
is generally well within the write limits used for the specification, so the 
fash should be less stressed than was used to derive the specs. JFFS2 might 
be different here.
* YAFFS is very conservative on dealing with ECC failures. YAFFS retires a 
block if one ECC failure is seen. JFFS2, IIRC allows five of so failure 
before retiring a block. The Toshiba folk have told me that if a block is 
going bad, it is most likely to start displaying recoverable 1-bit errors 
before displaying non-recoverable multi-bit errors. Thus, YAFFS will 
potentially perform differently in this area.

Still, I think those rliability differences, at the flash level, are more than 
likely theoretical noise and are unlikely to be material in the real world.

One important factor, IMHO, is how you handle the write protect pin on the 
NAND. Some people tie the WP to the power supply failure flag. IMHO this is a 
bad thing to do since it can cause incomplete writes to happen if the wp is 
asserted during a write or erase cycle.

-- Charles