PL353 NAND Controller - SW vs HW ECC
Andrea Scian
andrea.scian at dave.eu
Mon Feb 9 03:37:37 PST 2026
Dear all,
I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
Usually we check ECC functionality with mtd_nandbiterrs but it's way of testing ECC correction is quite obscure and unmaintained (see a thread between me and Miquel on this mailing list in December 2025 on this topic).
We've thus moved to userspace nandflipbits which give much more control on bitflip generation, making easier to understand if everything's fine or not.
By using this tool, I'm able to reproduce what I think is a PL353 HW ECC malfunction, that I think is hardware related (there's some, cryptic IMHO, errata on this) but I may be missing something and it may be "just" a software bug
There's also the obvious 3rd option: PEBKAC. I'm doing something wrong with my test setup, either on kernel/test configuration/usage or in hw setup ;-) )
Step 1 - SW ECC
Thanks to my patch (and mailing list review) now I can use SW Hamming ECC on Zynq7k based devices. So this test is about using software hamming ECC on (1 bit on 256 byte)
This is the device tree
&nfc0 {
status = "okay";
nand at 0 {
reg = <0x0>;
#address-cells = <0x1>;
#size-cells = <0x1>;
nand-ecc-mode = "soft";
nand-ecc-algo = "hamming";
nand-ecc-strength = <1>;
nand-ecc-step-size = <256>;
nand-on-flash-bbt;
nand-bus-width = <8>;
status = "okay";
partition at nand-ubi {
label = "ubi";
reg = <0x00000000 0x0>;
};
};
};
To make it quick, I'm using just the first EB, with a simple string on it (in my case,
this is useful for testing on u-boot too, but this is for another separate thread ;-) )
root at sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root at sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
I'm now inserting one bitflip, which is detected and corrected as expected
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 1
root at sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
With an additional bitflip, we have an uncorrectable error (and this is, again, expected)
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 2
root at sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
The same applies to another combination of bitflips (this will be useful later and don't look at ECC counters.. I had to reboot the system ;-) )
root at sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root at sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 1
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 0
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
Step 2 - PL353 HW ECC
The device tree is now
&nfc0 {
status = "okay";
nand at 0 {
reg = <0x0>;
#address-cells = <0x1>;
#size-cells = <0x1>;
nand-ecc-mode = "hw";
nand-ecc-strength = <1>;
nand-ecc-step-size = <256>;
nand-on-flash-bbt;
nand-bus-width = <8>;
status = "okay";
partition at nand-ubi {
label = "ubi";
reg = <0x00000000 0x0>;
};
};
};
Please note that PL353 is not using nand-ecc-step-size property correctly, but this is a secondary issue (this NAND device requires 1 bit on 512 byte, so it's fine anyway)
root at sw0005-devel:/lib/modules# cat /sys/class/mtd/mtd0/ecc_step_size
512
Re-doing the same test as above
root at sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root at sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
One single bitflip is detected and corrected as expected:
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 1
root at sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtst testing....|
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 75 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |just testing....|
a 2nd bitflip is detected as uncorrectable as expected:
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 2
root at sw0005-devel:~# nanddump -n -c -s 0 --length=100 /dev/mtd0 | head -n 1
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 0
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 uncorrectable bitflip(s) at offset 0x00000000
0x00000000: 6a 74 72 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jtrt testing....|
But there's some corner case, e.g. double bit flip that are detected (wrongly) as single bitflip and return wrong data:
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 2
root at sw0005-devel:~# nandflipbits /dev/mtd0 1 at 1
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 1
ECC corrected: 2
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6a 76 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |jvst testing....|
Another full test from scratch (ECC corrected counter is bigger that expected because I had to try a few combination, without rebooting ;-) )
root at sw0005-devel:~# flash_erase /dev/mtd0 0 1
Erasing 128 Kibyte @ 0 -- 100 % complete
root at sw0005-devel:~# echo just testing | nandwrite -p /dev/mtd0
Writing data to block 0 at offset 0x0
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 1
root at sw0005-devel:~# nandflipbits /dev/mtd0 0 at 0
root at sw0005-devel:~# nanddump -c -s 0 --length=100 /dev/mtd0 | head -n 1
ECC failed: 1
ECC corrected: 6
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000064...
ECC: 1 corrected bitflip(s) at offset 0x00000000
0x00000000: 6b 74 73 74 20 74 65 73 74 69 6e 67 0a ff ff ff |ktst testing....|
Conclusions:
IIUC with the results of the above test, we have an issue on PL353 because it cannot detect double bit errors (at least some combination of them) and, while this is a rare event on SLC NAND devices (that requires 1 bit ECC to guarantee 100k PE cycles), I think that this might give some catastrophic failures on field (because, AFAIK, upper MTD layers, like UBI, don't expect this situation).
Am I wrong?
I kindly ask to the MTD experts if I have to worry about this or if we can assume that correcting 1 bit error is enough for this subsystem.
If this is not acceptable, I think we have to update the driver at least to warn the user about this and use SW ECC where possible. WDTY?
If anybody in this list can also help me in understanding if I'm doing something wrong with my test or may have some setup/configuration error, it's appreciated and I can make some additional test.
In the mean time, as my previous patch anticipate, I'm trying to setup a configuration using only SW ECC (I'm currently stuck in Linux/U-Boot compatibility, but this might not the right ML to discuss this topic)
Any feedback is appreciated. Kind Regards,
Andrea SCIAN
SW Development Manager
DAVE Embedded Systems
More information about the linux-mtd
mailing list