[PATCH v0] mtd: gpmi: Use cached syndromes to speedup erased region bitflip detection.

Wed Jan 8 12:10:22 EST 2014

eliedebrauwer at gmail.com wrote:

> (Some background info, I expect these chips to be less than half used,
> hence plenty of erased blocks, also the NAND timings are already
> optimized for these chips).

That may not be true if the higher layer file system does wear
levelling.  Especially dynamic wear levelling where seldom read inodes are
moved in order to spread the wear.

> On Wed, Jan 8, 2014 at 6:38 AM, Huang Shijie <b32955 at freescale.com> wrote:
>> thanks for the new patch.
>>
>> I suddenly think out a new solution about this issue:
>> [1] when the bitflip occurs, the BCH will tells up the uncorrectable,
>> [2] if we catch a uncorrectable error, we could check the whole buffer, and
>> count the number of the bitflips. Assume we get the bitflips is N.
>>
>> [3] if N is < (gf_len ), we could think this is a erased page, and call the
>> memset to the whole buffer, and tell the upper layer that this is a good
>> empty page.
>>
>> [4] since the [1] is very rare, i think this method is much faster then the
>> current solution.

>> Elie De Brauwer wrote:

> What you suggest will obviously be able to reach the maximum speed,
> I've been playing with a similar idea too but always bumped into the issue
> that you cannot know whether or not a page is actually an erased page or
> not. In your solution for example you cannot distinguish between:
> - an erased page which suffers from bitflips
> - a genuine uncorrectable page whose original contents may be close to
> all 0xff's, but only has a handful of bits set to 0.

> The risk of this approach is that if you bump into an erased page, which the
> system should mark as uncorrectable, (and deal with accordingly)  is returned
> as a valid all 0xff's page which the system will consider as valid data.

> I agree this may sound a bit theoretical, and the risk (uncorrectable
> on a page with very few 0 bits) is small, but I'm not able to judge
> whether or not it can/should be neglected.

> What do you think ?

I think you have a good rational.  However, there will always be errors
which can by-pass the bit-flips checking of any ECC (just like any hash
has some theoretical collision/conflict).  The higher layer file system
should still have some integrity checks and the '-EUCLEAN' should have
been a clue that these pages/sectors were wearing previously.  Ie, the
NAND device should gradually fail.  I don't think it is fair to expect
the MTD drivers to perform on a flash that it excessively used to the
point where they are so broke it has bit flips way beyond the maximum
specified by the vendors.

Probably there is no completely right answer, just as you analyze.  How
do you tell between a mostly xff page with bit flips versus a bad
hardware-ECC of an erased page?  It maybe possible to 're-read' the page
with the hardware ECC turned off to validate that the whole page is xff
(I have guessed you are comparing data with ECC applied?). But this would
again affect performance.

I think that we should have some reliability at the upper layers.  How
do you combat against power fail, etc?  The ECC should be able to
correct errors.  This foists error detection to another layer, with it
highly likely to be caught at this layer.  Running file systems on
mtdblock, etc might like this but those have other issues, don't they?

My 2cents,
Bill Pringlemeir.