[RFC] UBI torture test fails to detect some bad blocks.

Fri Apr 8 01:24:11 PDT 2016

Hi!

Am 08.04.2016 um 09:23 schrieb Arnaud Mouiche:
> Hi all.
> 
> Just some details about what I experience recently with some bad blocs on 
> a MX35LF1GE4AB spinand device (SLC, 1Gb, 4bits ECC per 512 sub-page), 
> where a UBI partition is attached to manage rootfs & co  (as usual).
> 
> I get the hand on some devices refusing to boot.
> The analyse of the Erase Counters shows that some of them where erased 
> more than 100K, while the majority have an EC below 20 !

Ouch.

> Looking at the bad one, they run the following scenario nearly in loop:
> - linux read some file inside the rootfs
> - a bitflip is detected
> - scrubbing is scheduled.
> - the scrubbing target a PEB with a pretty high EC,
> - this high EC is also due to frequent bitflip in the target PEB in the past.
> - while the PEB data are moved, a bitflip is detected scheduling a torture test.
> - the torture test *ALWAYS* pass (whereas bitflip are *VERY* frequent for 
>   the same PEB when the read comes filesystem read).
> 
> So, it seems obvious the PEBs in question are bad PEBs.
> The question is now why the torture test pass.
> 
> Reproducing the pattern test by hand on this block shows the same result.
> But applying different patterns on different pages within the block shows that 
> the content of some pages are affected by the content of the other pages.
> In particularly, for this block, if the first page is full of FF and the rest 
> of the block is full of 00, I can count  more than 100 bitflips (!)

100 flips per ECC step? Shouldn't this lead to a uncorrectable ECC error?
I have no idea how much bits your ECC can fix..
Which bitflip threshold do you have? UBI sees bitflips only after a threshold
is reached. If it is too low, UBI scrubs too often, which seems to be the case here.
It is perfectly fine to have bitflips.

So, we need dig a bit deeper first.

> What kind of pattern should be added to detect those kind of issues ?

This is a very hard question and almost impossible to answer as it is vendor
specific.

> We can think of testing every page one by one, but given the relatively large 
> number of pages in a block, it doesn't sound realistic.
> The easiest way could be to use a random pattern, and try it a relative low 
> number of times.
> Indeed, this simple random test is efficient to detect every bad block of this device.
> If the random test pass once (because this is a random test), there are chances 
> that the next bit flip detection will trigger a new torture test, and at the end, 
> it will be finally detected as bad.

Having an additional random pattern is not a bad idea.
This is definitively something we can consider adding to UBI.
But I'm not happy with your implementation.

peb_rnd_buff = kmalloc(ubi->peb_size, GFP_KERNEL);
... is a big no-no. peb_size can be a few megabytes.

What about repeating a few random bytes over and over?

> And the implementation is pretty obvious...

;-)

Thanks,
//richard