NAND ECC errors
Chris Packham
Chris.Packham at alliedtelesis.co.nz
Thu Aug 7 13:36:56 PDT 2025
Hi Markus,
On 08/08/2025 03:16, markus.stockhausen at gmx.de wrote:
> Hi,
>
> Chris (CC) developed the drivers/spi/spi-realtek-rtl-snand.c for the
> Realtek switch platform. Thanks for that and the inclusion into mainline.
> While adding it to one of my devices I'm getting ECC errors.
>
> Situation is as follows.
>
> - Linksys LGS328 (with RTL9301 SOC and that NAND controller)
> - OpenWrt with Kernel 6.12 longterm
> - The Realtek SPI NAND driver (backported from current master)
> - Macronix MX35LF1GE4AB (1GBit)
> - Boot via TFTP
>
> I found a vendor UBI partition in NAND that I want to analyze.
> It is actively and the vendor firmware seems to work on in.
> I assume it contains a filesystem with configuration and logs.
> During ubiattach I get tons of errors "ubi0 warning: ubi_io_read:
> Error -77 (ECC error) while reading 64 bytes from PEB 0:0, read
> only 64 bytes, retry".
>
> Call stack shows:
>
> spinand_mtd_regular_page_read
> spinand_read_page
> spinand_load_page_op
> spinand_wait -> sets status = STATUS_ECC_UNCOR_ERROR
> nand_ecc_finish_io_req start
> spinand_ondie_ecc_finish_io_req run
> spinand_check_ecc_status start
> macronix_ecc_get_status -> reads status & returns -EBADMSG
>
> Reading data from NAND directly I see this data layout for 2K data
>
> - 4x 512 bytes data
> - 4x 6 bytes oob = 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
> - 4x 10 bytes ECC
>
> A quick ECC calc for empty blocks says it must be BCH6. So now I have
> several options but have no idea if I'm right or which to follow.
>
> 1. The NAND chip seems to have ECC build in. Ignored by vendor?
As far as I understand the expectation in Linux was that all SPI-NAND
chips have on-die ECC.
> 2. There is a hardware ECC controller -> Driver must be coded
Yes there is an ECC controller in the RTL93xx chips but based on the
comment above (and some pretty useless documentation) I elected not to
attempt to use it.
> 3. Maybe I must activate the software BCH driver
Software BCH might be an alternative to using the ECC controller.
> 4. The old vendor firmware (Linux 4.x) uses other ECC logic.
I think this is the crux of the problem. Realtek seem totally
uninterested in upstreaming support for their chips (not sure how that's
going to pan our with emerging requirements like RED and CRA) so it's
left to people like you and I. In the meantime their SDK has made
decisions that upstream don't know about and when it comes to things
like NAND ECC layouts this causes problems.
> Anyone good ideas what to do first from here?
Probably depends. Blanking the NAND chip and reformatting it will
resolve the errors from and upstream point of view. That's obviously not
really going to be something you want to do if you expect to swap back
and forth between the stock firmware and an upstream kernel.
You'll probably want to convince the mtd code to allow the on-die ECC to
be disabled and find whatever software BCH settings are needed that work
with the stock firmware. Then we could maybe look at using the ECC
controller to accelerate that.
> Thanks in advance.
>
> Markus
>
More information about the linux-mtd
mailing list