nand: WARNING: a0000000.nand: the ECC used on your system (1b/256B) is too weak compared to the one required by the NAND chip (4b/512B)

Wed Jun 23 06:16:40 PDT 2021

Hi Christophe,

Christophe Leroy <christophe.leroy at csgroup.eu> wrote on Wed, 23 Jun
2021 11:41:46 +0200:

> Le 19/06/2021 à 20:40, Miquel Raynal a écrit :
> > Hi Christophe,
> >   
> >>>> Now and then I'm using one of the latest kernels (Today is 5.13-rc6), and sometime in one of the 5.x releases, I started to get errors like:
> >>>>
> >>>> [    5.098265] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.103859] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 60
> >>>>     bytes from PEB 99:59824, read only 60 bytes, retry
> >>>> [    5.525843] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.531571] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.537490] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 30
> >>>> 73 bytes from PEB 107:108976, read only 3073 bytes, retry
> >>>> [    5.691121] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.696709] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.702426] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.708141] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [    5.714103] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 30
> >>>> 35 bytes from PEB 107:25144, read only 3035 bytes, retry
> >>>> [   20.523689] random: crng init done
> >>>> [   21.892130] ecc_sw_hamming_correct: uncorrectable ECC error
> >>>> [   21.897730] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 13
> >>>> 94 bytes from PEB 116:75776, read only 1394 bytes, retry
> >>>>
> >>>> Most of the time, when the reading of the file fails, I just have to read it once more and it gets read without that error.  
> >>>
> >>> It really looks like a regular bitflip happening "sometimes". Is this a
> >>> board which already had a life? What are the usage counters (UBI should
> >>> tell you this) compared to the official endurance of your chip (see the
> >>> datasheet)?  
> >>
> >> The board had a peacefull life:
> >>
> >> UBI reports "ubi0: max/mean erase counter: 49/20, WL threshold: 4096"  
> > 
> > Mmmh. Indeed.
> >   
> >>
> >> I have tried with half a dozen of boards and all have the issue.
> >>  
> >>>    >>>> What am I supposed to do to avoid the ECC weakness warning at startup and to fix that ECC error issue ?  
> >>>
> >>> I honestly don't think the errors come from the 5.1x kernels given the
> >>> above logs. If you flash back your old 4.14 I am pretty sure you'll
> >>> have the same errors at some point.  
> >>
> >> I don't have any problem like that with 4.14 with any of the board.
> >>
> >> When booting a 4.14 kernel I don't get any problem on the same board.
> >>  
> > 
> > If you can reliably show that when returning to a 4.14 kernel the ECC
> > weakness disappears, then there is certainly something new. What driver
> > are you using? Maybe you can do a bisection?  
> 
> Using the GPIO driver, and the NAND chip is a HYNIX.
> 
> I can say that the ECC weakness doesn't exist until v5.5 included. The weakness appears with v5.6.
> 
> I have tried bisection between those two versions and I couldn't end up to a reliable result. The closer the v5.5 you go, the more difficult it is to reproduce the issue.
> 
> So I looked at what was done around the places, and in fact that's mainly optimisation in the powerpc code. It seems that the more powerpc is optimised, the more the problem occurs.
> 
> Looking at the GPIO nand driver, I saw that no-op gpio_nand_dosync() function. By adding a memory barrier in that function, the ECC weakness disappeared completely.

I see that the 'fix' in gpio_nand_dosync() has only been designed for
ARM platforms, perhaps it would make sense to have a PPC variant here?

> Not sure what the final solution has to be.

Perhaps PowerPC maintainers can sched some light on these findings?

Thanks,
Miquèl