[PATCH v2] mtd: nand: Add driver for M-sys / Sandisk diskonchip G4

Thu Nov 3 12:54:56 EDT 2011

On 11/03/2011 01:58 AM, Ivan Djelic wrote:
> Hi Mike,
> A few comments below:
>
>> +
>> +/* value generated by the HW ecc generator upon reading blank page */
>> +static uint8_t blank_read_hwecc[7] = {
>> +       0xcf, 0x72, 0xfc, 0x1b, 0xa9, 0xc7, 0xb9 };
> Using this blank ecc value to detect blank pages after reading them is not that
> reliable, because if a bitflip appears in an erased page, the HW ecc generator
> will generate a completely different sequence of bytes. Your driver will think
> the page is not blank, it will try to correct errors and fail. And UBIFS will
> not appreciate that.
>
> I can see two cleaner alternatives to solve this issue:
>
> 1. When you program a page, before writing hwecc to oob, adjust it like this:
>
>    hwecc[i] ^= blank_read_hwecc[i]^0xff;
>
> The effect of this is that you now get 0xffs as ecc for blank pages, and bitflip
> correction on erased pages for free. This is transparent to your controller,
> because this "adjustment" cancels itself upon reading when calc_ecc^recv_ecc is
> computed.

This is neat!  Will try this today.

> 2. Use unprotected spare oob byte 15 as a programmming marker: remove it from
> the oob_free list, and force it to 0 when you program a page. Now, you can
> easily detect if a page is blank or has been programmed by checking this byte.
> You can for instance count the number of bits set to 1 in the byte, and decide
> it is blank if that number is greater than 4; this ensures you are robust to
> bitflips in the marker byte itself.
>
> My preference would go to option 2 in your case.

Hmm, another good idea.  I haven't given much thought to robustness until now. 
Heck, I never expected to get this driver working right at all.  Once again, I'm
in your debt, Ivan. 

>> +
>> +       /* fix the errors */
>> +       fixederrs = 0;
>> +       for (i = 0; i < numerrs; i++) {
>> +               if (errpos[i] > DOCG4_USERDATA_LEN * 8)
>> +                       continue; /* error in ecc oob bytes; ignore */
>> +               change_bit(errpos[i], (unsigned long *)buf);
>> +               fixederrs++;
>> +       }
> Here, your check on errorpos[i] to skip ecc oob locations is correct; but a
> bitflip (in ecc bytes or elsewhere) is an indication that the nand page needs
> scrubbing (you don't want bitflips to accumulate), so you should increase
> fixederrs in that case also.

Thanks.  Will make this change.

> If you see a _lot_ of scrubbing operations from UBI, you may also consider
> concealing bitflip corrections; i.e. report them only if the number of corrected
> bits is above a certain threshold (e.g. >= 3 in your case).

Something to keep in mind when I start stress testing with ubifs.  I have
noticed some scrubbing in the testing so far.  At one point, it marked a block
as bad, so this may be an issue.  The error rate seems erratic.  Sometimes there
are many, and even the occasional uncorrectable error.  Lately, there have just
a few one-bit errors when I run nandtest over the whole device.  I noticed this
inconsistency while observing operation under the PalmOS code as well, so it's
not a problem in my driver.  Maybe due to solar activity :-)

Thanks again,
Mike