Atmel Nand PMECC UBI ECC issue

Mon Mar 26 13:07:49 PDT 2018

Oliver,

Am Montag, 26. März 2018, 16:56:17 CEST schrieb Olivier Schonken:
> Sorry for the resend, seems my gmail editor was in HTML mode which got
> rejected by the mailing list.  Humble apologies.
> 
> I have run into an issue with the Atmel nand controller on the
> SAMA5D36, which I am struggling to debug.
> 
> We are using custom hardware based on the SAMA5D36. With Micron
> MT29F8G08ABBCAH4 NAND flash.  Kernel version is 4.14.29 - mainline
> from kernel.org.  ECC strength is 24 bits with 1024 byte sector size.
> The PMECC settings was calculated as per
> https://www.at91.com/linux4sam/bin/view/Linux4SAM/PmeccConfigure, with
> the nand HEADER value at 0xc0e18e05.
> 
> The system works, and only some units present the error, the baffling
> part of it, is that a unit can work properly for a long while, and
> then suddenly the error presents itself. (Once traced it to a glibc
> library file, which means it isn't even due to heavy writing on the
> filesystem.) I have noticed that most of the time the PEB in which the
> error occurs is the same.  Even after reprogramming the device via
> ubiformat, or SAM-BA.
> 
> In the attached log output, you will see that there is a UBIFS error,
> where it detects a bitflip, which I confirmed by comparing the binary
> sequence to the Buildroot generated ubi file.
> 
> Using Atmel's SAM-BA to read back the contents of the NAND flash,
> yields the correct contents for the page causing the ECC error.
> 
> 31 18 10 06 00 FE A2 74 FB CF 00 00 00 00 00 00 C5 05 00 00 01 00 00
> 00 AB 0C 00 00

At which offset it this?

> Starting up linux again results in the same issue.
> This extract shows the ubifs magic number with the bitflip. The rest
> of the binary sequence matches a unique part of the ubi image.
> 
> [   75.140000] 7fe0: b6f8f8e4 becf7a40 b6be5788 b6e9c000 60000010 ffffffff
> [   75.150000] UBIFS error (ubi0:0 pid 1): ubifs_check_node: bad magic
> 0x6101830, expected 0x6101831
> [   75.160000] UBIFS error (ubi0:0 pid 1): ubifs_check_node: bad node
> at LEB 325:216208
> [   75.160000] Not a node, first 24 bytes:
> [   75.160000] 00000000: 30 18 10 06 00 fe a2 74 fb cf 00 00 00 00 00
> 00 c5 05 00 00 01 00 00 00
> 0......t................
> [   75.180000] CPU: 0 PID: 1 Comm: systemd Not tainted 4.14.29+ #706
> 
> mtdinfo for the partition in question
> Type:                           nand
> Eraseblock size:                262144 bytes, 256.0 KiB
> Amount of eraseblocks:          2048 (536870912 bytes, 512.0 MiB)
> Minimum input/output unit size: 4096 bytes
> Sub-page size:                  4096 bytes
> OOB size:                       224 bytes
> Character device major/minor:   90:10
> Bad blocks are allowed:         true
> Device is writable:             true
> 
> Device tree entry:
>         nand_controller: nand-controller {
>                 status = "okay";
> 
>                 nand at 3 {
>                     reg = <0x3 0x0 0x800000>;
>                     atmel,rb = <0>;
>                     nand-bus-width = <8>;
>                     nand-ecc-mode = "hw";
>                     nand-ecc-strength = <24>;
>                     nand-ecc-step-size = <1024>;
>                     nand-on-flash-bbt;
>                     label = "atmel_nand";
>                 };
>             };
> 
> Attached are the dmesg traces with the ECC issue.  A nanddump of the
> block with the ECC error, including OOB contents as per "nanddump -f
> nandblock-withoob.ubi /dev/mtd5 -s 0x51c0000 -o -l 262144 &>
> nanddump-cmdline-output.txt"

Can you please share the dump without OOB?
UBI does not use OOB, so we don't need it and can use offsets as seen by UBI 
and UBIFS as-is. :)

Thanks,
//richard