problem with ecc errors and ubifs

Tue Oct 22 05:05:42 PDT 2013

On Tue, 22 Oct 2013, Steffen Kühn wrote:

> We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
> supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
> started to use this flash no support for on die ecc was available in the
> kernel (kernel 3.2). The on die ecc support is - because of that - self
> written.
> ...
>
> I have seen that such error reporting leads usually to a page "scrub". I
> do not really understand what there happens.

I don't know the details, but scrubbing means that the data in the page 
is to be rewritten at another place in the flash, since the presence of a 
bit error indicates that the data in the page will eventually become 
unreliable.

The code as it looks today triggers scrubbing whenever a single bit 
correction is detected during read. The reason for this is that the 
classic Hamming algorithm can only handle one incorrect bit, so if another 
bit flips the data becomes unreadable.

I know there have been discussions that when using ECC that can correct 
more than a single bit in a given area, to not trigger scrubbing as soon 
as a single bit goes bad, but use a threshold mechanism, so that scrubbing 
is triggered first when, say, half the maximum amount of bits need 
correcting (e.g. in your case when 4 bits need correcting), the reason 
being that flashes which require multibit ecc tend to have bits here and 
there that flip rather quickly after writing, so triggering on them leads 
to undue scrubbing and hence wear on the flash.

> But sometimes the result is catastrophic. Because of that I have removed 
> the error reporting (my hope is that 8 bit errors occur seldom enough in 
> a page) After that code removing our problems are completely vanished (I 
> have even tested with more than 8 bit errors in the same page => no 
> problems). I could not provoke any faults by creating numerous bit 
> errors in dozens of pages.

Removing the 'corrected bit' reporting mechanism avoids scrubbing, and 
hence that code path is never executed. It seems a serious bug if the 
scrubbing mechanism doesn't work as intended, for whatever reason.

> What is your opinion? Have I overlooked something? I know that this
> method has risks but I hope that under the line the file system stays
> longer alive.

The risk would be that eventually the number of errors would grow past the 
ECC capability and subsequently lead to unreadable data. If there indeed 
is a bug in the scrubbing mechanism, I agree it would be better to just 
correct the bits and hope not too many of them flip. But it would be 
better to try and fix the problem with scrubbing...

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30