Problem with UBI / UBIFS (mainly ucorrectable error) on kernel higher than 2.6.30.10

Mon May 21 08:57:04 EDT 2012

Hi Artem,

Thanks for quick the response.

1. Please find attached log that you asked for. This time the mounting
process started to run torture test for PEB 2062 in infinitive loop.

2. I have compared the MTD sources between 2.6.30.10 (last good kernel)
and 2.6.31.1 (first with problems) and I found where is the problem.

The new version of kernel (2.6.31.1) introduced some new buffer (in
orion_nand.c file in drivers/mtd/nand directory you will find new method
orion_nand_read_buf). What I did is that I have commented this line:

//nc->read_buf = orion_nand_read_buf;

in kernel 2.6.31.1 and recompiled it. This helps and no errors occurred
this time during boot. I have tested 7 different devices and all of them
are now OK. What is more I did the same with kernel 2.6.38.4 (our target
kernel) and again no errors occurred during the boot and everything
seems to work just fine.

So my question is what is the buffer responsible for and is it safe for
us to just remove it like I did?

Like you said it is not UBI/UBIFS error, but something wrong is with the
MTD driver (orion_nand exactly).

Łukasz

On Fri, 2012-05-18 at 16:35 +0300, Artem Bityutskiy wrote:
> On Thu, 2012-05-17 at 13:45 +0200, Lukasz Nowak wrote:
> > 1. When using kernels: 2.6.30.1, 2.6.30.9, 2.6.30.10 the procedure of
> > attaching and mounting UBI device is OK and we are able to use it as our
> > rootfs.
> 
> OK.
> 
> > 2. When switching to kernel 2.6.31.1 and any higher (2.6.38.4 was the
> > highest used in the test) we are observing a lot of errors during the
> > attach/mount process:
> 
> OK, it gives a possibility to bisect and find the offending commit at
> least.
> 
> > UBI error: ubi_io_read: error -74 while reading 3281 bytes from PEB
> > 1898:96200,s
> > UBIFS error (pid 1): try_read_node: cannot read node type 1 from LEB
> > 70:94152, 4
> > uncorrectable error : 
> > UBI error: ubi_io_read: error -74 while reading 3281 bytes from PEB
> > 1898:96200,s
> > UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x743bfaf8,
> > read 0x70
> > UBIFS error (pid 1): ubifs_check_node: bad node at LEB 70:94152
> > UBIFS error (pid 1): ubifs_read_node: expected node type 1
> > UBIFS error (pid 1): do_readpage: cannot read page 257 of inode 2046,
> > error -117
> > uncorrectable error : 
> > UBI error: ubi_io_read: error -74 while reading 3281 bytes from PEB
> > 1898:96200,s
> > UBIFS error (pid 1): try_read_node: cannot read node type 1 from LEB
> > 70:94152, 4
> > uncorrectable error : 
> > UBI error: ubi_io_read: error -74 while reading 3281 bytes from PEB
> > 1898:96200,s
> > UBIFS error (pid 1): ubifs_check_node: bad CRC: calculated 0x743bfaf8,
> > read 0x70
> > UBIFS error (pid 1): ubifs_check_node: bad node at LEB 70:94152
> > UBIFS error (pid 1): ubifs_read_node: expected node type 1
> > UBIFS error (pid 1): do_readpage: cannot read page 257 of inode 2046,
> > error -117
> 
> I really doubt this is a UBIFS changes which causes this issue. May be
> there was something changed at the MTD level?
> 
> Did you run MTD tests to validate your driver?
> 
> Do you normally do power cuts, or you always shut down the board
> gracefully and 'sync' before shutting it down?
> 
> > Sometimes we see also errors "UBI: scrubbed PEB 1873 (LEB 0:1752), data
> > moved to PEB 1608", but the system boots and we can use it, but we are
> > not sure how long it will keep such good condition.
> 
> This message is OK - it is just FYI that UBI detected a bit-flip (which
> is normal) and it moves the contents of eraseblock 1873 to eraseblock
> 1608 in order to clean-up the bit-flip.
> 
> But if you see too many of these - it is not so normal.
> 
> >  There were
> > situations were we upgraded the firmware (rootfs on mtd4 partition) ando
> > after that the motherboards was not able to boot up anymore (UBI mount
> > failed with similar errors like that one above)
> 
> Well, there are too many unknowns to tell anything.
> 
> > What is strange that the error don't come all the time. Some of the
> > motherboards boots with the same configuration and some of them gives us
> > errors like that above. But the most important thing here is that kernel
> > lower that 2.6.31.1 works always, so my conclusion is that there is some
> > bug in the MTD support in kernels higher that 2.6.30.10.
> 
> May be something changes, may be it is just random luck. UBIFS tells you
> about ECC errors which may be caused by many things. Start from
> validating your drivers. Then start doing isolated UBIFS tests.
> 
> We maintain UBIFS back-port trees - try to pull the one corresponding to
> your version.
> 
> > 3. I am attaching some additional info about our configuration:
> > 
> > - attached full log from failed boot up process,
> > - attached full log from OK boot up process,
> > - used kernel configuration files,
> > - output from mtdinfo,
> > - the procedure of flashing the mtd device.
> > 
> > If you need something more like debug logs I can deliver it with short
> > period of time. If you would like to get the motherboard for some
> > debugging or tests there will be no problem with this. Just ask.
> 
> First of all, remember to boot with "ignore_loglevel" option to see all
> messages, because your logs are incomplete (no debugging level
> messages). Send boot log produced this way.
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: log.tar.gz
Type: application/x-compressed-tar
Size: 360875 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-mtd/attachments/20120521/7d3d5cec/attachment-0001.bin>