preventing multi-bit errors on NAND
ricard.wanderlof at axis.com
Wed Oct 8 06:54:15 EDT 2008
On Wed, 8 Oct 2008, Norbert van Bolhuis wrote:
> Some of our NAND chips (128MB, 128k blocks, 2k pages) have multi but errors
> I studied some JFFS2/NAND/MTD source code and wonder whether we could
> have prevented this. I also looked at the latest JFFS2/NAND/MTD code
> (kernel 2.6.26) and there aren't any major ECC/bad-block changes/improvements.
I think UBI provides at least some of the functionality you are looking
for (see the MTD homepage).
Otherwise, NAND flash reliability seems to vary considerably between
manufacturers in this respect, regardless of the similarities when it
comes to specifications.
> Why not use 6 bytes ECC code (per 256 bytes) to correct at max 2 bits ?
> I know, this is not standard and would cause incompatibilities, still
> I'd like to know whether it could be done or already has been done. There's
> enough room in the OOB I believe.
That would help anyway.
> Why not mark a block bad when detecting a single-bit error ?
> I assume a multi-bit error was a singe-bit error before.
> A single-bit error is corrected and that's it. Nobody knows about it,
> let alone JFFS2 acts upon it.
That would eventually fill the whole flash with bad blocks, and for no
really good reason (see below). The correct solution is to 'scrub' the
flash, i.e. rewrite the data, potentially at another location (to avoid
loosing the data if a power outage occurs during the scrubbing procedure).
I believe UBI does this.
> Would #defining CONFIG_MTD_NAND_VERIFY_WRITE have helped/prevented this ?
> Currently CONFIG_MTD_NAND_VERIFY_WRITE isn't #defined. It would probably
> better to actually #define it. It looks like a failed verification doesn't
> lead to a block marked bad, why not ?
The problem with NAND flash is that not only do the bit cells themselves
wear out as with NOR flash, leading to lower data retention times, and
ultimately causing write and erase operations to fail, but also that the
due to the design and cell density of the memory array, the data in the
bit cells tends to decay over time. Reading the bit cell accellerates the
process (and there are other influences, such as reading bit cells in the
same area of the chip etc). There is nothing wrong or worn out about the
flash, it is just a natural consequence of the technology used.
Consequently, reading back the data immediately after writing will not
find any errors, as the bit errors creep up with time, as the data in the
bit cells decays. The only real way to deal with this is to rewrite data
from time to time, or increase the number of bits that the ECC algorithm
can handle, as you mention.
That said, some manufacturers seem to be worse than others. In a random
sample test we did with the memory density you mention, after about half a
million reads from a 512 kbyte partition on a Numonyx (ST) or Hynix flash
there were consistently multiple bit errors, whereas after 4.5 million
reads from a similarly configured Samsung flash yields no bit errors at
all. Admittedly, this was a forced test, with the test script repeatedly
reading the area over and over again over the course of a couple of days,
but it still says something about the relative reliability of the parts at
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
More information about the linux-mtd