Question about Davinci NAND using 4 bit ECC

Tue Jul 18 01:02:37 PDT 2023

Hi Christopher,

christophercordahi at nanometrics.ca wrote on Mon, 17 Jul 2023 21:46:34
-0400:

> Hello Linux TI NAND experts,
> 
> An ex-colleague discovered a discrepancy between the nand_davinci_correct_4bit()
> function [1] and the sequence of steps recommended by TI.  I'm using a TI OMAP
> L138 processor, based on the DA-850 EVM board.  Ideally the code should match
> the sequence of steps in the OMAP-L138 Technical Reference Manual section
> 20.2.5.6 NAND Flash Mode [2].  This very old driver was initially written for
> the DM355 processor on a DM-355 EVM board.  The documentation in the older
> TMS320DM35x Digital Media System-on-Chip (DMSoC) Asynchronous External Memory
> Interface (EMIF) [3] is similar to the more recent documentation for the OMAP
> L138, but not exactly the same.  I suspect both processors have the same EMIF
> peripheral, and the L138 procedure is simply more up-to-date, but I can't be
> sure.  And the code doesn't match either one.
> 
> The first difference is that the code contains the following
> 
> /*
> * Clear any previous address calculation by doing a dummy read of an
> * error address register.
> */
> davinci_nand_readl(info, NAND_ERR_ADD1_OFFSET);
> 
> /* Start address calculation, and wait for it to complete.
> * We _could_ start reading more data while this is working,
> * to speed up the overall page read.
> */
> davinci_nand_writel(info, NANDFCR_OFFSET,
>         davinci_nand_readl(info, NANDFCR_OFFSET) | BIT(13));
> 
> The dummy read is not mentioned in the DM-355 documentation.
> The f12a9473283e68ae708e9ada37cb352ea2652397 commit [4] in which it was added
> doesn't provide any details.  This dummy read is mentioned in step 10 of the
> newer L138 documentation, but it should be after the "Start address calculation"
> write in step 9 and it specifically indicates to not use that error address
> register.
> 
> > 10. Perform a dummy read to any EMIFA registers except the NAND Flash error
> > address 1-2 registers (NANDERRADD[2:1]) or the NAND Flash error value 1-2
> > registers (NANDERRVAL[2:1]).  
> 
> The code then continues with an extra block introduced in commit
> 1c3275b656045aff9a75bb2c9f3251af1043ebb3 [5] to work-around unexpected behaviour
> before continuing with what seems to be the correct procedure.
> 
> /*
> * ECC_STATE field reads 0x3 (Error correction complete) immediately
> * after setting the 4BITECC_ADD_CALC_START bit. So if you immediately
> * begin trying to poll for the state, you may fall right out of your
> * loop without any of the correction calculations having taken place.
> * The recommendation from the hardware team is to initially delay as
> * long as ECC_STATE reads less than 4. After that, ECC HW has entered
> * correction state.
> */
> timeo = jiffies + usecs_to_jiffies(100);
> do {
>     ecc_state = (davinci_nand_readl(info,
>             NANDFSR_OFFSET) >> 8) & 0x0f;
>     cpu_relax();
> } while ((ecc_state < 4) && time_before(jiffies, timeo));
> 
> Can anyone rationalize these discrepancies?
> Do these two discrepancies offset each other resulting in correct behaviour?
> If this procedure is in fact incorrect, what would the symptoms be?
> Does this mean that anything read from NAND Flash could be corrupted?
> 
> Any clarification would be greatly appreciated.

It is hard to tell whether these changes are legitimate or not without
properly testing. If you benchmark your changes and never see any
error, it does not mean you fall into the pits during your tests. If
however, by removing one of the two hacks, you observe a wrong
behavior, then you could perhaps tell that the latter offsets the
former. But TBH I don't think they are related, on one side we want to
reset the outcome of the former calculations, on the other side we need
to wait a bit before a status register can be trusted.

Observing bitflips in userspace and detecting them can be achieved
easily with the nandbiterrs -i tool from the mtd-utils test suite. This
tool manually introduces bitflips and expects them to be corrected
(until proper failure) by the ECC engine. Be aware, it will smash your
data.

> [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/mtd/nand/raw/davinci_nand.c
> [2] https://www.ti.com/lit/ug/spruh77c/spruh77c.pdf#page=884
> [3] https://www.ti.com/lit/ug/sprued1b/sprued1b.pdf?ts#page=29
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/mtd/nand/davinci_nand.c?id=f12a9473283e68ae708e9ada37cb352ea2652397
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/mtd/nand/davinci_nand.c?id=1c3275b656045aff9a75bb2c9f3251af1043ebb3
> 

Good luck,
Miquèl