NAND ECC errors

Miquel Raynal miquel.raynal at bootlin.com
Fri Aug 8 01:37:21 PDT 2025


On 07/08/2025 at 20:36:56 GMT, Chris Packham <Chris.Packham at alliedtelesis.co.nz> wrote:

> Hi Markus,
>
> On 08/08/2025 03:16, markus.stockhausen at gmx.de wrote:
>> Hi,
>>
>> Chris (CC) developed the drivers/spi/spi-realtek-rtl-snand.c for the
>> Realtek switch platform. Thanks for that and the inclusion into mainline.
>> While adding it to one of my devices I'm getting ECC errors.
>>
>> Situation is as follows.
>>
>> - Linksys LGS328 (with RTL9301 SOC and that NAND controller)
>> - OpenWrt with Kernel 6.12 longterm
>> - The Realtek SPI NAND driver (backported from current master)
>> - Macronix MX35LF1GE4AB (1GBit)
>> - Boot via TFTP
>>
>> I found a vendor UBI partition in NAND that I want to analyze.
>> It is actively and the vendor firmware seems to work on in.
>> I assume it contains a filesystem with configuration and logs.
>> During ubiattach I get tons of errors "ubi0 warning: ubi_io_read:
>> Error -77  (ECC error) while reading 64 bytes from PEB 0:0, read
>> only 64 bytes, retry".
>>
>> Call stack shows:
>>
>> spinand_mtd_regular_page_read
>>    spinand_read_page
>>      spinand_load_page_op
>>      spinand_wait -> sets status = STATUS_ECC_UNCOR_ERROR
>>      nand_ecc_finish_io_req start
>>        spinand_ondie_ecc_finish_io_req run
>>          spinand_check_ecc_status start
>>            macronix_ecc_get_status -> reads status & returns -EBADMSG
>>
>> Reading data from NAND directly I see this data layout for 2K data
>>
>> - 4x 512 bytes data
>> - 4x 6 bytes oob = 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
>> - 4x 10 bytes ECC
>>
>> A quick ECC calc for empty blocks says it must be BCH6. So now I have
>> several options but have no idea if I'm right or which to follow.
>>
>> 1. The NAND chip seems to have ECC build in. Ignored by vendor?
> As far as I understand the expectation in Linux was that all SPI-NAND 
> chips have on-die ECC.

This was initially true, but a year ago (or so) I added support for
external engines, allowing to use software and external HW engines.

>> 2. There is a hardware ECC controller -> Driver must be coded
> Yes there is an ECC controller in the RTL93xx chips but based on the 
> comment above (and some pretty useless documentation) I elected not to 
> attempt to use it.

It was not even possible to do differently at that time :)

>> 3. Maybe I must activate the software BCH driver
> Software BCH might be an alternative to using the ECC controller.

It is now indeed. I haven't played much with sw engines with SPI NANDs
but it should work, checkout the bindings.

>> 4. The old vendor firmware (Linux 4.x) uses other ECC logic.
> I think this is the crux of the problem. Realtek seem totally 
> uninterested in upstreaming support for their chips (not sure how that's 
> going to pan our with emerging requirements like RED and CRA) so it's 
> left to people like you and I. In the meantime their SDK has made 
> decisions that upstream don't know about and when it comes to things 
> like NAND ECC layouts this causes problems.

The ECC layout matters if you use jffs2, or if you disable the on-die
ECC engine and replace it by something else.

>> Anyone good ideas what to do first from here?

Any chances the data could be scrambled? (just asking). Be careful,
current Macronix DT property to enable scrambling is an OTP bit. There
is a method with ->set_feature() which is volatile but it is not yet
implemented upstream. So in case your vendor fw did enable it, it might
lead to errors appearing like uncorrectable ECC errors.

> Probably depends. Blanking the NAND chip and reformatting it will 
> resolve the errors from and upstream point of view. That's obviously not 
> really going to be something you want to do if you expect to swap back 
> and forth between the stock firmware and an upstream kernel.
>
> You'll probably want to convince the mtd code to allow the on-die ECC to 
> be disabled and find whatever software BCH settings are needed that work 
> with the stock firmware. Then we could maybe look at using the ECC 
> controller to accelerate that.

Yes.

Good luck,
Miquèl



More information about the linux-mtd mailing list