problem with ecc errors and ubifs

Steffen Kühn sk at ammonit.com
Tue Oct 22 04:52:37 PDT 2013


Hey,

my question is about ecc errors and the way ubifs deals with it.
Unfortunately, I have to be a bit detailed to make my problem clear.

We use MT29F4G08ABBDAHC flash devices from Micron. This type of flash
supports on-die-ecc (8 bit) and needs at least 4-bit-ecc. When we have
started to use this flash no support for on die ecc was available in the
kernel (kernel 3.2). The on die ecc support is - because of that - self
written.

Everything works in principle very well. But we use our hardware in
greater numbers and quite intensively. Over the months we have observed
numerous destroyed file systems with different ubi errors. For finding
the reason of that problem I have written a mechanism to create bit
errors (in U-Boot).

With that I made different tests. One test was to create only one single
bit error in the whole flash device. The on die ecc mechanism (which can
correct up to 8 bit errors) had no problems to correct this error. The
kernel code has now a piece of code where the bit error occurrence is
reported to the stages above. With this information can ubifs decide if
and what it has to do.

I have seen that such error reporting leads usually to a page "scrub". I
do not really understand what there happens. But sometimes the result is
catastrophic. Because of that I have removed the error reporting (my
hope is that 8 bit errors occur seldom enough in a page) After that code
removing our problems are completely vanished (I have even tested with
more than 8 bit errors in the same page => no problems). I could not
provoke any faults by creating numerous bit errors in dozens of pages.

What is your opinion? Have I overlooked something? I know that this
method has risks but I hope that under the line the file system stays
longer alive.

Best
Steffen




More information about the linux-mtd mailing list