[RFC/PATCH 0/5 v2] mtd:ubi: Read disturb and Data retention handling

Tue Nov 11 13:39:19 PST 2014

Tanya,

Am 11.11.2014 um 21:36 schrieb Tanya Brokhman:
> Hi Artem,
> 
> Hope I didn't drop any ccs this time... Sorry about that. Not on purpose.
> 
> On 11/7/2014 10:58 AM, Artem Bityutskiy wrote:
>> On Thu, 2014-11-06 at 14:16 +0200, Tanya Brokhman wrote:
>>> What I'm trying to say - it
>>> may be too late and you may lose data here. "preferred to prevent rather
>>> than cure".
>>
>> First of all, just to clarify, I do not have a goal of turning down your
>> patches. I just want to understand why this is the best design, and if
>> it is helpful to all Linux MTD users.
>>
>> Modern flashes have strong ECC codes protecting against many bit-flips.
>> MTD even was modified to stop reporting about a single or few bit-flips,
>> because those happen too often and they are "harmless", and do not
>> require scrubbing. We have the threshold value in MTD for this, which is
>> configurable, of course.
>>
>> Bit-flips develop slowly over time. If you get one more bit-flips, it is
>> not too late yet. You can mitigate the "too late" part by reading more
>> often of course.
>>
>> You also may lower the bit-flip threshold when reading for scrubbing.
>>
>> Could you try to "sell" your design in a way that it becomes clear why
>> it is better than just reading the entire flash periodically.
> 
> Please see my "selling" bellow :)
> 
>  Some hard
>> experimental data would be preferable.
> 
> Unfortunately none. This is done for a new device that we received just now. The development was done on a virtual machine with nandsim. Testing was more of stability and regression
> 
>>
>> The advantages of the "read all periodically" approach were:
>>
>> 1. Simple, no modifications needed
>> 2. No need to write if the media is read-only, except when scrubbing
>> happens.
>> 3. Should cover all the NAND effects, including the "radiation" one.
> 
> Disadvantages (as I see it):
> 1. performance hit: when do you trigger the "read-all"? will effect performance

Only a stupid implementation will re-read/scrub all PEBs at once.
We can use a low priority thread. We can do this even in userspace.

> 2. finds bitflips only when they are present instead of preventing them from happening

We can scrub unconditionally.
Even if we scrub every PEB once a week the erase counters won't go up very much.

> Perhaps our design is an overkill for this and not covering 100% of te usecases. But it was requested by our customers to handle read-disturb and data retention specifically (as in
> "prevent" and not just "fix"). This is due to a new NAND device that should operate in high temperature and last for ~15-20 years.
> 
> But we did rethink this and we're dropping the "last erase timestamp" that was used to handle "data retention". We will force-scrub all PEBs once in a while (triggered by user) as
> Richard suggested.
> We're keeping the read counters though. I know that not all "read-disturb" scenarios are covered by this but it's more coverage then we have at the moment. So not 100% perfect
> solution but better then none.
> 
> I will update the implementation and change the fastmap layout (as suggested by Richard earlier) or try using internal UBI volume. Still have some study to do on that...

Please don't (ab)use fastmap. If you really need persistent read-counters use an internal UBI volume.
But I think that time-based unconditional scrubbing will also do it. As long we don't have sane threshold values
keeping counters is useless.

Thanks,
//richard