master node can not be recovered

Wed Nov 4 00:40:49 PST 2015

Hi Artem,

On Wed, 04 Nov 2015 10:20:50 +0200
Artem Bityutskiy <dedekind1 at gmail.com> wrote:

> On Wed, 2015-11-04 at 09:03 +0100, Richard Weinberger wrote:
> > But if two or more pages are corrupted UBIFS will give up as this
> > most not happen
> > from UBIFS's point of view.
> 
> Right, and I hear that a lot of bug reports and frustration comes from
> this. This worked with SLCs we were using when implementing UBIFS
> (particularly, Samsung OneNAND was used, it was very high-quality
> NAND). Nowadays, this needs to be changed.
> 
> UBIFS logic is this. If there is a corruption, then it must be in the
> last used NAND page. Pages after this NAND page must contain empty
> space.
> 
> A small complication, which is not important now, is that UBIFS may
> operate with multiple NAND pages, this depends on what the driver tells
> is the min. IO size.
> 
> No the logic behind this was that we always write data from the
> beginning of the LEB, and continue to its end. In case of a power cut,
> we can only get corruption in the last NAND page (or more strictly,
> min. I/O unit) where we were writing to. The next NAND page and all the
> NAND pages after it should be empty. The previous NAND page and all the
> NAND pages before it should contain valid data (CRC OK).
> 
> Pretty simple. Worked well.
> 
> So what has to be changed in this logic? Obviously, the definition of
> empty space should be changed, it seems, because obviously not every
> driver wants/can ECC-protect the empty space.

There's a series trying to address this problem by fixing bitflips in
erased pages, and hopefully all NAND controller drivers will be fixed
at some point (either by using the generic helper or by implementing a
similar logic to detect bitflips in erased pages).

> 
> What else?

Well I don't know if this is the subject here, but for MLC NANDs,
because of paired pages, the corruption can occur between two pages
that have already been correctly written.
The question is, what should we do in this case? Should we drop all the
pages following the corrupted page in the LEB? Should we only drop the
faulty page and parse the nodes in the valid pages we can find after
this corruption (I don't know if it can be the case, but if some nodes
depend on other nodes, doing that may not work)? Any other option?

Best Regards,

Boris

[1]http://thread.gmane.org/gmane.linux.drivers.mtd/61358

-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com