preventing multi-bit errors on NAND

Wed Oct 8 06:54:15 EDT 2008

On Wed, 8 Oct 2008, Norbert van Bolhuis wrote:

>
> Some of our NAND chips (128MB, 128k blocks, 2k pages) have multi but errors
> ...
> I studied some JFFS2/NAND/MTD source code and wonder whether we could
> have prevented this. I also looked at the latest JFFS2/NAND/MTD code
> (kernel 2.6.26) and there aren't any major ECC/bad-block changes/improvements.

I think UBI provides at least some of the functionality you are looking 
for (see the MTD homepage).

Otherwise, NAND flash reliability seems to vary considerably between 
manufacturers in this respect, regardless of the similarities when it 
comes to specifications.

> Why not use 6 bytes ECC code (per 256 bytes) to correct at max 2 bits ?
> I know, this is not standard and would cause incompatibilities, still
> I'd like to know whether it could be done or already has been done. There's
> enough room in the OOB I believe.

That would help anyway.

> Why not mark a block bad when detecting a single-bit error ?
> I assume a multi-bit error was a singe-bit error before.
> A single-bit error is corrected and that's it. Nobody knows about it,
> let alone JFFS2 acts upon it.

That would eventually fill the whole flash with bad blocks, and for no 
really good reason (see below). The correct solution is to 'scrub' the 
flash, i.e. rewrite the data, potentially at another location (to avoid 
loosing the data if a power outage occurs during the scrubbing procedure). 
I believe UBI does this.

> Would #defining CONFIG_MTD_NAND_VERIFY_WRITE have helped/prevented this ?
> Currently CONFIG_MTD_NAND_VERIFY_WRITE isn't #defined. It would probably
> better to actually #define it. It looks like a failed verification doesn't
> lead to a block marked bad, why not ?

The problem with NAND flash is that not only do the bit cells themselves 
wear out as with NOR flash, leading to lower data retention times, and 
ultimately causing write and erase operations to fail, but also that the 
due to the design and cell density of the memory array, the data in the 
bit cells tends to decay over time. Reading the bit cell accellerates the 
process (and there are other influences, such as reading bit cells in the 
same area of the chip etc). There is nothing wrong or worn out about the 
flash, it is just a natural consequence of the technology used.

Consequently, reading back the data immediately after writing will not 
find any errors, as the bit errors creep up with time, as the data in the 
bit cells decays. The only real way to deal with this is to rewrite data 
from time to time, or increase the number of bits that the ECC algorithm 
can handle, as you mention.

That said, some manufacturers seem to be worse than others. In a random 
sample test we did with the memory density you mention, after about half a 
million reads from a 512 kbyte partition on a Numonyx (ST) or Hynix flash 
there were consistently multiple bit errors, whereas after 4.5 million 
reads from a similarly configured Samsung flash yields no bit errors at 
all. Admittedly, this was a forced test, with the test script repeatedly 
reading the area over and over again over the course of a couple of days, 
but it still says something about the relative reliability of the parts at 
least.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30