preventing multi-bit errors on NAND

Wed Oct 8 04:53:16 EDT 2008

Some of our NAND chips (128MB, 128k blocks, 2k pages) have multi but errors
# cat /proc/nand
single-bit data errors : 1000
single-bit ecc errors  : 8
multi-bit errors       : 4
double multi-bit errors: 5

This causes some of our products (using these NAND chips) to fail horribly
(since data is lost).

Btw. we're using on old kernel/JFFS2/NAND version (linux-2.4.25, MTD CVS 2005).

I studied some JFFS2/NAND/MTD source code and wonder whether we could
have prevented this. I also looked at the latest JFFS2/NAND/MTD code
(kernel 2.6.26) and there aren't any major ECC/bad-block changes/improvements.
Now I have the following questions:

Why not use 6 bytes ECC code (per 256 bytes) to correct at max 2 bits ?
I know, this is not standard and would cause incompatibilities, still
I'd like to know whether it could be done or already has been done. There's
enough room in the OOB I believe.

Why not mark a block bad when detecting a single-bit error ?
I assume a multi-bit error was a singe-bit error before.
A single-bit error is corrected and that's it. Nobody knows about it,
let alone JFFS2 acts upon it.

Would #defining CONFIG_MTD_NAND_VERIFY_WRITE have helped/prevented this ?
Currently CONFIG_MTD_NAND_VERIFY_WRITE isn't #defined. It would probably
better to actually #define it. It looks like a failed verification doesn't
lead to a block marked bad, why not ?
I guess that if a verification fails mtdblock will use another block
to write the data, is this correct ?

-- 
This message has been scanned for viruses and is believed to be clean