PL353 NAND Controller - SW vs HW ECC

Mon Feb 23 08:39:50 PST 2026

Dear Miquel,

(as usual, sorry for the late response)

Il 12/02/2026 11:40, Miquel Raynal ha scritto:
> Hi Andrea,
> 
> On 10/02/2026 at 15:14:30 +01, Andrea Scian <andrea.scian at dave.eu> wrote:
> 
>> Dear Miquel,
>>
>>
>> Il 10/02/2026 11:12, Miquel Raynal ha scritto:
>>> Hi Andrea,
>>> On 09/02/2026 at 11:37:37 GMT, Andrea Scian <andrea.scian at dave.eu>
>>> wrote:
>>>
>>>> Dear all,
>>>>
>>>> I hope I don't annoying you by putting directly in CC, but these people are the one that were already involved in my patch to fix SW ECC support in PL353 NAND controller (mainly used in Xilinx/AMD Zynq7k SoC), and I think are the one that might help me with this follow-up.
>>>>
>>>> Our standard HW/SW validation procedure for BSPs includes (after some basic functional tests) raw NAND MTD tests.
>>>>
>>>> Usually we check ECC functionality with mtd_nandbiterrs but it's way
>>>> of testing ECC correction is quite obscure and unmaintained (see a
>>>> thread between me and Miquel on this mailing list in December 2025 on
>>>> this topic).
>>> We stopped developing the kernel modules, for testing we advise to use
>>> the same tools from the mtd-utils test suite which are actively
>>> maintained.
>>> nandbiterrs -i is the correct tool for testing your ECC engine. It
>>> works
>>> this way (from memory, maybe not 100% accurate, but that's the idea):
>> [snip]
>>
>> Got it! This looks very similar to the kernel module and, in
>> fact, I got the same results, depending on the seed choosen.
> 
> Ah, finally. I was very suspicious about this observation in the first
> place. I remember you were reporting failures in the nandbiterrs -i test
> with seed=1, which means we must fall into one of the 90 cases that are
> not properly covered by the ECC engine?
> 
> [...]

Yes, this does not match with the error cases provided by the errata 
(IIUC that document) but it matches the fact that it fails in some cases 
(which, to me, anyway means that this ECC engine is unusable)

[...]

>> Refer to r2p1 IP revision which is affected by errata ID 721059
>> It's statement nr 3 says
>>
>> "Some double error cases are not correctly identified as uncorrectable fail"
>>
>> "90 double errors out of the 8485140 possible double error combinations
>> are not correctly identified as
>> uncorrectable fail"
>>
>> "All double errors in the data (8386560 possible errors) will be
>> correctly identified as uncorrectable fail"
> 
> This is only one issue over 3.
> 
>> However, this is NOT my experience, as you can see from the above testing.
> 
> Maybe you fell into one of the two other cases?

I don't think so, for this reason I've ignored them in my previous email
but I here they are for sake of completeness:

1) A single bit error in data byte 0, bit 0 will not be detected

2)  A single bit error in second 12 bits of the 3 parity bytes read from
the spare area, will incorrectly be identified as having passed
the error check

So these case are about single bit errors, that was never an issue for 
me ¯\_("/)_/¯

> 
> [...]
> 
>>>> Please note that PL353 is not using nand-ecc-step-size property
>>>> correctly, but this is a secondary issue (this NAND device requires 1
>>>> bit on 512 byte, so it's fine anyway)
>>> Can you elaborate? Looking at the driver, it takes the ECC
>>> configuration
>>> from the core (hence, usually from the DT), otherwise it falls back to
>>> what the chip advertises in terms of requirements, and finally it falls
>>> back to 1b/512B as default.
>>
>> I tried with
>>
>> &nfc0 {
>> 	status = "okay";
>> 	nand at 0 {
>> 		reg = <0x0>;
>> 		#address-cells = <0x1>;
>> 		#size-cells = <0x1>;
>>
>> 		nand-ecc-mode = "hw";
>> 		nand-ecc-strength = <1>;
>> 		nand-ecc-step-size = <256>;
>>
>> 		nand-on-flash-bbt;
>>
>> 		nand-bus-width = <8>;
>> 		status = "okay";
>>
>> 		partition at nand-ubi {
>> 			label = "ubi";
>> 			reg = <0x00000000 0x0>;
>> 		};
>> 	};
>> };
>>
>> But I still got
>>
>> root at sw0005-devel:~# cat /sys/class/mtd/mtd0/ecc_step_size
>> 512
>>
>>
>> Maybe I'm missing something, but it seems that even if PL353 get this
>> from code/NAND requirements in pl35x_nand_attach_chip()
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n948
>>
>> In case of (host) HW ECC this gets later overwritten when initializing
>> PL353 ECC controller in pl35x_nand_init_hw_ecc_controller()
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mtd/nand/raw/pl35x-nand-controller.c#n910
> 
> This is a bug, either you shall remove the '= 512' assignment (if the
> configuration of the ECC engine is already done correctly, there is
> nothing to do except removing this limit), or you should refuse any
> value that is not 512 if the engine cannot be configured for 256B steps,
> else you should add the logic to configure the ECC engine logic for
> steps != 512.

Thanks for the explanation
I'll try to figure this out, but it's not my main objective

>>>> I kindly ask to the MTD experts if I have to worry about this or if we
>>>> can assume that correcting 1 bit error is enough for this subsystem.
>>> No, the expectation is a clear failure upon double bit errors. Be
>>> careful though, Hamming ECC engines carry *no guaranty* for 3 bit
>>> errors. Only 0, 1 and 2 are part of the scope, and 2 bit errors are
>>> uncorrectable, which means:
>>> - 0 bf, ok
>>> - 1 bf, ok + reporting 1 bf
>>> - 2 bf, NOK + reporting an error
>>> - more bf: no guarantee, usually returns incorrect data with a correct
>>> status
>>
>> Thanks for pointing this out. I was not aware about the last case when
>> using Hamming algo.
>> Probably we'll have to move to BCH, even if, IIRC, it requires more
>> CPU horsepower to do the job.
> 
> It does, and you can observe the impact with a speed test, eg:
> 
>          flash_speed -dc10 /dev/mtdx
> 
> [...]

another userspace tool that replace mtd_speedtest.ko, thanks ;-)

>>> This is obviously just speculation, maybe the errata you mentioned above
>>> will bring an obvious hardware failure to our attention. The Arasan IP
>>> used on ZynqMP also suffers from a similar limitation (not able to
>>> correctly report failures) and I decided to implement one path using the
>>> software BCH engine, with a time penalty of course.
>>
>> So that's nearly the same result I'm trying to get with Zynq7k
>> platform ;-)
> 
> In this case, I can only recommend the blog post I wrote for the Arasan
> controller. I believe it should be easier to do as you won't need the
> polynomial retro-engineering.
> 
> Link: https://bootlin.com/blog/supporting-a-misbehaving-nand-ecc-engine/

Thanks for this detailed article!
I think I'll move to hamming code (if I make it work in u-boot) or BCH 
(after evaluating the performance impact)

I have some issue with Linux to U-Boot interoperability (single bit
error are not correctly handled in u-boot hamming implementation), is
this the right mailing list to ask for advice or should I move to u-boot ML?

Kind Regards,

Andrea Scian