Does modern UBI/UBIFS still suffer from the 'unstable bits issue'?

Thu Mar 1 17:19:54 PST 2018

On Thu, Mar 1, 2018 at 8:32 AM, Richard Weinberger <richard at nod.at> wrote:
> Tim,
>
> Am Donnerstag, 1. März 2018, 17:15:44 CET schrieb Tim Harvey:
>> Greetings,
>>
>> I have a user with an IMX6 and raw NAND using UBI/UBIFS who has been
>> able to reproduce a NAND corruption:
>
> What does your user to reproduce this?

Richard,

It's unclear at the moment. It's one of those 'this happened twice on
two different boards' reports without a lot of detail. However I do
know they do write to the filesystem on every boot and do encounter
random power-cuts.

>
>> [   10.611972] UBIFS (ubi0:2): background thread "ubifs_bgt0_2" started, PID
>> 631 [   10.634365] ubi0 warning: ubi_io_read: error -74 (ECC error) while
>> reading 253952 bytes from PEB 2807:8192, read only 253952 bytes, retry [
>> 10.657492] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading
>> 253952 bytes from PEB 2807:8192, read only 253952 bytes, retry [
>> 10.681137] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading
>> 253952 bytes from PEB 2807:8192, read only 253952 bytes, retry [
>> 10.704267] ubi0 error: ubi_io_read: error -74 (ECC error) while reading
>> 253952 bytes from PEB 2807:8192, read 253952 bytes
>>
>> The kernel they are using is a bit out of date but does have
>> 'gpmi-nand: Handle ECC Errors in erased pages' [1] patch
>>
>> I'm wondering if the 'unstable bits issue' [2] is still an issue or if
>> the UBI/UBFS Documentation is out of date and this has been resolved.
>> If it has been resolved, can anyone point me to the patches.
>
> This issue is highly theoretical and I never actually saw it in the wild.
> Every single time someone claimed to suffer from that, it turned out to be
> something else. Currently UBI/UBIFS has no counter measurement, for the said
> reasons.
> This reminds me that we have to update the website...
>
> So did you verify (with your NAND vendor) that this really is the named issue?

I have no idea if what the user reported is the unstable bits issue
but the fact you've never seen it occur in the wild tells me probably
not.

They are using a rather old kernel (4.4 but with a patch to gpmi-nand
backported from 4.7). I will setup a controlled test with random
power-cuts in a test fixture I have to see if I can get it to re-occur
on a) the old kernel and then b) the current kernel.

Thanks for the feedback!

Tim