[PATCH 0/3] MTD: Change meaning of -EUCLEAN return code on reads
Mike Dunn
mikedunn at newsguy.com
Fri Mar 16 12:25:08 EDT 2012
Hi Ivan. Thanks for the review!
On 03/16/2012 04:19 AM, Ivan Djelic wrote:
>
> Consider the following situation:
> - a NAND device with 2kB pages and 4 ecc steps per page (4 x 512 bytes)
> - the driver has chip->ecc.strength = 4, and therefore mtd->ecc_strength = 16
> - let's say mtd->bitflip_threshold = 16
>
> The driver read() method could return a non-negative integer, say 4, in at least
> the following cases:
>
> 1. During a single page read, each of the 4 ecc steps corrected 1 bit, with a
> total variation of ecc_stats.corrected equal to 4.
> => no cleaning needed
>
> 2. During a single page read, 1 ecc step corrected 4 bits, the 3 other steps had
> no correction to perform, with a total variation of ecc_stats.corrected equal
> to 4.
> => cleaning is needed
Maybe my (admittedly limited) understanding of the physical nature of NAND flash
is flawed. I assumed that a writesize region (i.e., a NAND page for our
purposes) is the most elemental unit wrt physical wear, regardless of whether or
not ecc is caclulated once for the whole page or incrementally in steps.
>
> In both cases, you will compare the same value 4 to mtd->bitflip_threshold (16)
> and decide to return 0 (and not -EUCLEAN).
>
> So my point is that the cleaning decision happens at the ecc step level,
> not at the page reading level.
But you're sayimg my assumption is incorrect. So each ecc-sized area within a
page is physically distinct and must be considered in isolation? Could you
maybe elaborate on this?
>
> I think this could be fixed by dropping 'ecc_strength' and changing the semantics
> of 'bitflip_threshold' in the following way (rephrasing your explanation):
>
> (3) The drivers' read methods, absent an error, return a non-negative integer
> indicating the maximum number of bit errors that were corrected in any one
> ecc step. MTD returns -EUCLEAN if this is >= bitflip_threshold, 0
> otherwise.
>
> So basically, the meaning of -EUCLEAN is changed from "one or more bit errors
> were corrected", to "a dangerously high number of bit errors were corrected on
> one or more ecc step block". By default, "dangerously high" is interpreted
> as chip->ecc.strength. Drivers can specify a different value, and the user can
> override it if more or less caution regarding data integrity is desired.
>
> But still, there is a problem: how do we implement (3), i.e. how do we know
> "the maximum number of bit errors that were corrected in any one ecc step" ?
>
> Just looking at ecc_stats.corrected is not enough, as it accumulates over each
> ecc step result, and does not allow us to distinguish cases 1 and 2 (from my
> previous example). Maybe we could have per-step ecc stats ? or have the driver
> return directly the information ?
Yes, this will require more work, touching many drivers :( The per-page stats
allowed me to limit most of the changes to nand_base.c.
If you are correct about the need to consider each ecc-sized region separately,
then these patches are actually a regression, since a "dangerously high" number
of bitflips will be considered OK, as your example illustrates.
Thanks again.
Mike
More information about the linux-mtd
mailing list