[PATCH 0/3] MTD: Change meaning of -EUCLEAN return code on reads

Mike Dunn mikedunn at newsguy.com
Fri Mar 16 12:25:08 EDT 2012


Hi Ivan.  Thanks for the review!

On 03/16/2012 04:19 AM, Ivan Djelic wrote:
> 
> Consider the following situation:
> - a NAND device with 2kB pages and 4 ecc steps per page (4 x 512 bytes)
> - the driver has chip->ecc.strength = 4, and therefore mtd->ecc_strength = 16
> - let's say mtd->bitflip_threshold = 16
> 
> The driver read() method could return a non-negative integer, say 4, in at least
> the following cases:
> 
> 1. During a single page read, each of the 4 ecc steps corrected 1 bit, with a
> total variation of ecc_stats.corrected equal to 4.
> => no cleaning needed
> 
> 2. During a single page read, 1 ecc step corrected 4 bits, the 3 other steps had
> no correction to perform, with a total variation of ecc_stats.corrected equal
> to 4.
> => cleaning is needed


Maybe my (admittedly limited) understanding of the physical nature of NAND flash
is flawed.  I assumed that a writesize region (i.e., a NAND page for our
purposes) is the most elemental unit wrt physical wear, regardless of whether or
not ecc is caclulated once for the whole page or incrementally in steps.


> 
> In both cases, you will compare the same value 4 to mtd->bitflip_threshold (16)
> and decide to return 0 (and not -EUCLEAN).
> 
> So my point is that the cleaning decision happens at the ecc step level,
> not at the page reading level.


But you're sayimg my assumption is incorrect.  So each ecc-sized area within a
page is physically distinct and must be considered in isolation?  Could you
maybe elaborate on this?


> 
> I think this could be fixed by dropping 'ecc_strength' and changing the semantics
> of 'bitflip_threshold' in the following way (rephrasing your explanation):
> 
>   (3) The drivers' read methods, absent an error, return a non-negative integer
>       indicating the maximum number of bit errors that were corrected in any one
>       ecc step.  MTD returns -EUCLEAN if this is >= bitflip_threshold, 0
>       otherwise.
> 
>   So basically, the meaning of -EUCLEAN is changed from "one or more bit errors
>   were corrected", to "a dangerously high number of bit errors were corrected on
>   one or more ecc step block".  By default, "dangerously high" is interpreted
>   as chip->ecc.strength.  Drivers can specify a different value, and the user can
>   override it if more or less caution regarding data integrity is desired.
> 
> But still, there is a problem: how do we implement (3), i.e. how do we know
> "the maximum number of bit errors that were corrected in any one ecc step" ?
> 
> Just looking at ecc_stats.corrected is not enough, as it accumulates over each
> ecc step result, and does not allow us to distinguish cases 1 and 2 (from my
> previous example). Maybe we could have per-step ecc stats ? or have the driver
> return directly the information ?


Yes, this will require more work, touching many drivers :(  The per-page stats
allowed me to limit most of the changes to nand_base.c.

If you are correct about the need to consider each ecc-sized region separately,
then these patches are actually a regression, since a "dangerously high" number
of bitflips will be considered OK, as your example illustrates.

Thanks again.

Mike



More information about the linux-mtd mailing list