[PATCH 0/3] MTD: Change meaning of -EUCLEAN return code on reads

Sat Mar 17 16:50:33 EDT 2012

Many thanks Shmulik.  This helped my thought process, which was distracted by
consideration of the "physical" nature of the device ;)  Yes, the integrity of
the entire page can be threatened in any one ecc step, so the danger threshold
should be based on the ecc strength of each step.

Thanks again,
Mike

On 03/16/2012 02:54 PM, Shmulik Ladkani wrote:
> Hi Mike,
> 
> On Fri, 16 Mar 2012 09:25:08 -0700 Mike Dunn <mikedunn at newsguy.com> wrote:
>> Maybe my (admittedly limited) understanding of the physical nature of NAND flash
>> is flawed.  I assumed that a writesize region (i.e., a NAND page for our
>> purposes) is the most elemental unit wrt physical wear, regardless of whether or
>> not ecc is caclulated once for the whole page or incrementally in steps.
> 
> Bit-flips may occur at a per-cell basis, even on the OOB cells, as a
> result of program-disturb, charge-loss, or cell ware-out causing read
> sensing errors.
> 
>> But you're sayimg my assumption is incorrect.  So each ecc-sized area within a
>> page is physically distinct and must be considered in isolation? 
> 
> There's no "physical" distinction, in the sense that cells are separated
> in the device or alike.
> Simply, the ECC algorithm is independently calculated over several
> portions of the page.
> But that's not a must: suppose X bits per Y bytes ECC is required; you
> may use a 2X / 2Y ECC and acheive similar intergrity and endurance
> statistical characteristics.
> 
> For your purposes, the question whether the cleaning decision should be
> according to the ecc step level, is dependent of how you define
> "a dangerously high number of bit errors".
> 
> Lets continue with Ivan's example (2KiB page, 4 eccsteps, 512 bytes
> each step, strength 4bits/512bytes).
> Suppose the first ECC portion has 4 bit errors, the other 3 portions
> have none.
> If, for example, several read operations later, a new bitflip is
> intorduced within the first portion, leading to 5 bit errors.
> Obviously, the ECC algorithm is now unable to correct this portion,
> meaning, the buffer is corrupt - which also means the entire page data
> read is corrupt. The nand infrastructure would return -EBADMSG - and you
> had just 5 bit errors over the entire page.
> 
> So question is, would you consider 4 bit errors in the first ECC portion
> to be "a dangerously high number of bit errors" as what's reported to
> the MTD users?
> If so, then yes, the cleaning decision should be according to the ecc
> step level, not at the page reading level.
> 
> Regards,
> Shmulik
> 
>