[RFC/PATCH 0/5 v2] mtd:ubi: Read disturb and Data retention handling

Wed Nov 12 03:55:23 PST 2014

On Tue, 2014-11-11 at 22:36 +0200, Tanya Brokhman wrote:
> Unfortunately none. This is done for a new device that we received just 
> now. The development was done on a virtual machine with nandsim. Testing 
> was more of stability and regression

OK. So the implementation is theory-driven and misses the experimental
prove. This means that building a product based on this implementation
has certain amount of risk involved.

And from where I am, the theoretical base for the solution also does not
look very strong.

> > The advantages of the "read all periodically" approach were:
> >
> > 1. Simple, no modifications needed
> > 2. No need to write if the media is read-only, except when scrubbing
> > happens.
> > 3. Should cover all the NAND effects, including the "radiation" one.
> 
> Disadvantages (as I see it):
> 1. performance hit: when do you trigger the "read-all"? will effect 
> performance

Right. We do not know how often, just like we do not know how often and
how much (read counter threshold) in your proposal.

Performance - sure, matter of experiment, just like the performance of
your solution. And as I notice, energy too (read - battery life).

In your solution you have to do more work maintaining the counters and
writing them. With read solution you do more work reading data.

The promise that reading may be done in background, when there is no
other I/O.

> 2. finds bitflips only when they are present instead of preventing them 
> from happening

But is this true? I do not see how is this true in your case. Yo want to
scrub by threshold, which is a theoretical value with very large
deviation from the real one. And there may be no real one even - the
real one depends on the erase block, it depends on the I/O patterns, and
it depends on the temperature.

You will end up scrubbing a lot earlier than needed. Here comes the
performance loss too (and energy). And you will eventually end up
scrubbing too late.

I do not see how your solution provides any hard guarantee. Please,
explain how do you guarantee that my PEB does not bit-rot earlier than
read counter reaches the threshold? It may bit-rot earlier because it is
close to be worn out, or because of just higher temperature, or because
it has a nano-defect.

> Perhaps our design is an overkill for this and not covering 100% of te 
> usecases. But it was requested by our customers to handle read-disturb 
> and data retention specifically (as in "prevent" and not just "fix"). 
> This is due to a new NAND device that should operate in high temperature 
> and last for ~15-20 years.

I understand the whole customer orientation concept. But for me so far
the solution does not feel like something suitable to a customer I could
imagine. I mean, if I think about me as a potential customer, I would
just want my data to be safe and covered from all the NAND effects. I
would not want counters, I'd want the result. And in the proposed
solution I would not see how I'd get the guaranteed result. But of
course I do not know the customer requirements that you've got.