state of support for "external ECC hardware"

Wed Nov 14 05:59:06 EST 2012

Hi Chris,

Sorry to come to this thread late (I have been working on non-Flash related
projects recently!), but I have used several Micron "on-die" ECC parts so I
thought I would share my experience.   I will try to collate here my comments to
some of the issues raised in the thread.

On 11/08/2012 04:21 PM, Christopher Harvey wrote:
> We had BCH8 code running, but it wasn't enough. The main reason we
> switched away from host side ECC was because we were getting bitflips
> within the ECC codeword data itself.

I think this point has already been dealt with by others, but just to confirm,
the ECC algorithms handle bit-flips in the data or ECC code.  (In fact, as part
of my test procedures, I manually insert bit-errors in the data and ECC areas
and check for correct fixups.)

On 11/08/2012 04:37 PM, Gerlando Falauto wrote:
> And BTW, wouldn't you also need to explicitly disable on-die ECC in
> order to force that, anyway?

Yes, to insert bit-errors your driver needs to support raw read and write
operations.  You can then use a combination of nandwrite and nanddump to inject
1 or more bitflips, something like:

	1. Write data to Flash, generating ECC data in the process
		nandwrite page.bin /dev/mtd?
	2. Read data + OOB in raw mode
		nanddump --noecc --oob --length=2048 --file=page_ecc.bin /dev/mtd?
	3. Check data for no *real* bit flips

	4. Inject bit-flips to 'page_ecc.bin'

	5. Write corrupted data to a new page, in raw mode
		nandwrite --noecc --oob page_ecc_err.bin

	6. Read back, using ECC
		nanddump --length --file=page_ecc_fix.bin

	7. Check bit-flips have been corrected

[I have a standalone program that implements the same procedure, testing every
bit and multiple bits, although it is not really fit for public consumption I am
afraid.]

On 11/08/2012 05:02 PM, Christopher Harvey wrote:
> I was surprised too. I was seeing about 30 bitflips per 512MB. Running
> at about 1/3 of max bus speed. No error codes on write.

That is probably a bit higher than we have experienced, but not significantly so.

On 11/08/2012 05:02 PM, Christopher Harvey wrote:
> I don't know the details of BCH, but apparently not. I asked Micron if
> the OOB area was safer to write to, and they said no. Can somebody on
> this list confirm this?

The OOB area is the same as any other part of the page, in terms of reliability,
and therefore subject to the same ECC requirements.  One thing to look out for
with the Micron devices is that the on-die ECC is applied to some but not all of
the OOB area.  For the ECC-protected OOB, it is important that any data here is
written at the same time as the page data -- this has consequences when using
filesystems that store meta-data in the OOB (eg YAFFS2 and JFFS2 to some
extent).  At the time, there was no user-space tool, or IOCTL, that could write
Page+OOB in one go.  To support writing YAFFS2 images, they had to invent their
own IOCTL and a new tool!

On 11/12/2012 05:19 PM, Gerlando Falauto wrote:
> Would there be any reason *NOT* to use 4-bit ECC with parts which do not
> require it? Apart from performance, of course.

As long as your ECC potency matches or exceeds the reliability characteristics
of the NAND device, there should be no problem (except perhaps performance.)
Indeed, some have been known to use over-spec'ed ECC schemes in an attempt to
improve endurance and data retention -- the qualification reports from the
manufacturers tend to be a bit vague on how effective this strategy might be though.

On 11/08/2012 11:02 AM, Gerlando Falauto wrote:
> As for hardware-based (or on-die) ECC support, one of the application
> notes from Micron (TN-29-56 Enabling On-Die ECC for OMAP3 on
> Linux/Android OS

The TN provides a good start, but neglects a few areas, including:
	* the default BBT pattern clashes with on-die ECC locations
	* it makes no attempt to support raw read/write operations
	* it does not handle the the REWRITE status flag

For what it's worth, I have attached the patch we added to support the Micron
on-die ECC devices -- based on a rather old 2.6.32 kernel I am afraid.  We have
since updated the probing code that detects on-die ECC capabilities, but it
might help if you are planning to do your own support.

Cheers,

Angus
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0001-mtd_nand-Support-for-Micron-on-die-4-bit-ECC-SLC-LP-.patch
URL: <http://lists.infradead.org/pipermail/linux-mtd/attachments/20121114/f5fb384a/attachment-0001.ksh>