NAND timeout issues with blank chip and Marvell NFC

Chris Packham Chris.Packham at alliedtelesis.co.nz
Thu Apr 26 15:43:10 PDT 2018


Hi Miquel,

On 26/04/18 19:03, Miquel Raynal wrote:
> Hi Chris,
> 
> On Thu, 26 Apr 2018 05:16:57 +0000, Chris Packham
> <Chris.Packham at alliedtelesis.co.nz> wrote:
> 
>> An update for the end of my working day.
>>
>> On 26/04/18 13:40, Chris Packham wrote:
>>> On 26/04/18 09:22, Chris Packham wrote:
>>>> Hi Miquel,
>>>>
>>>> On 25/04/18 04:08, Miquel Raynal wrote:
>>>>> Hi Steve, Chris,
>>>>>
>>>>> On Tue, 24 Apr 2018 08:49:47 -0700, Steve deRosier <derosier at gmail.com>
>>>>> wrote:
>>>>>   
>>>>>> Hi Chris,
>>>>>>
>>>>>> On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham
>>>>>> <Chris.Packham at alliedtelesis.co.nz> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We're in the process of qualifying new NAND chips (Macronix
>>>>>>> MX30LF2G18AC) for one of our Armada-385 based devices and we're
>>>>>>> experiencing some long startup times on units with factory fresh NAND
>>>>>>> chips. Anecdotally I think I've also seen this behaviour on the old
>>>>>>> chips as well (Micron MT29F2G08ABAEAWP-ITX:E).
>>>>>>>
>>>>>>> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see
>>>>>>>
>>>>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>>>>> nand: Macronix MX30LF2G18AC
>>>>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)
>>>>>>> Bad block table not found for chip 0
>>>>>>> Bad block table not found for chip 0
>>>>>>> Scanning device for bad blocks
>>>>>>>
>>>>>>> (nothing for some time)
>>>>>>>
>>>>>>> On an older kernel we see
>>>>>>>
>>>>>>> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
>>>>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>>>>> nand: Macronix MX30LF2G18AC
>>>>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>>>>> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
>>>>>>> Bad block table not found for chip 0
>>>>>>> Bad block table not found for chip 0
>>>>>>> Scanning device for bad blocks
>>>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>>>> ...
>>>>>>> (time outs continue for some time)
>>>>>>>
>>>>>>> Presumably the new driver in 4.17.0-rc2 is experiencing the same wait
>>>>>>> time out but just not complaining about it.
>>>>>>>
>>>>>>> If we leave the system running long enough (in the order of 30 minutes)
>>>>>>> things seem to sort themselves out and bootup continues, the subsequent
>>>>>>> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit
>>>>>>> and then boot into the kernel then things are also fine.
>>>>>>>
>>>>>>> If we run 'nand scrub.chip -y' from u-boot we are able to re-create the
>>>>>>> problem.
>>>>>>>
>>>>>>> Our suspicion is that erased state of the chip is probably not agreeable
>>>>>>> with either the ecc data or the bad block table location (or both). By
>>>>>>> erasing it from u-boot this must fill in valid data in the expected
>>>>>>> places and the kernel is happy.
>>>>>>>        
>>>>>>
>>>>>> During your very first boot, Linux can't find the bad-block table and
>>>>>> thus does a full scan of the chip, each and every block, to find the
>>>>>> manufacturer bad block marks and then constructs the table. I imagine
>>>>>> you've got a parameter incorrect somewhere that's causing it to wait
>>>>>> for timeouts at read points, instead of quickly able to read through
>>>>>> the 2k or 4k blocks on that flash.  On subsequent boots, you don't see
>>>>>> this issue because the BBT is found and Linux just uses that. Same
>>>>>> deal if you do a `nand erase.chip`, because the BBT is itself marked
>>>>>> with a bad-block marker and gets skipped during a normal erase.
>>>>>
>>>>> I share Steve's thoughts on that, there is probably some
>>>>> misconfiguration at some point, having a first long boot is not a
>>>>> problem, but 30 minutes for a 256MiB chip... What I don't understand is
>>>>> that you should have timeouts with the recent kernel too if there is
>>>>> actually something wrong happening.
>>>>
>>>> As I mentioned in my other reply I may have understated the time. It is
>>>> ~30mins with the old pxa3xx driver but the new one seems to block
>>>> indefinitely for me.
>>>>   
>>>>>>
>>>>>> Now, I don't know if you're aware of this, but by doing the `nand
>>>>>> scub.chip -y`, you've ruined the flash chip.  That device can not be
>>>>>> relied upon anymore. A scrub will ignore the factory bad-block-marks
>>>>>> and erase them. Unless you stored this information off-chip and
>>>>>> rewrite the markers, you've now lost the bad-block information from
>>>>>> the manufacturer's tests.  In any case, this erases the BBT, so your
>>>>>> next boot triggers Linux to rebuild the BBT.
>>>>>
>>>>> I think U-Boot will do it automatically after the scrub. But the result
>>>>> is still the same.
>>>>>   
>>>>>>   
>>>>>>> We could update our manufacturing procedures to run 'nand erase.chip'
>>>>>>> before the first boot but this feels wrong. Some of our devices boot
>>>>>>> over the network so the nand is not normally touched by the bootloader.
>>>>>>> It seems that there is some unhandled error condition that is stopping
>>>>>>> the kernel from seeing that the chip is completely blank and making
>>>>>>> forward progress.
>>>>>>>        
>>>>>>
>>>>>> erase chip won't fix your issue. The BBT scan is going to happen
>>>>>> anyway. There is however clearly some parameter that is setup
>>>>>> incorrectly that's causing it to wait for the timeout instead of being
>>>>>> able to quickly read pages. I don't see why that'd be unique to the
>>>>>> BBT scan however, I'd expect you to see the problem on all reads, thus
>>>>>> slowing down the system noticeably in general.
>>>>>>
>>>>>> Your hint is likely these lines:
>>>>>>         " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>>>>           marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)"
>>>>>>
>>>>>> You can go look at that in the driver and compare with the relevant
>>>>>> behavior in the datasheets. Sorry, but I can't help more specifically,
>>>>>> I'd have to know your particular hardware and datasheets and spend
>>>>>> some time looking at the code.
>>>>>
>>>>> I also reproduce the problem on my Armada 38x, the two timeouts at boot
>>>>> time (not specifically the first one) are suspicious, I'm going to look
>>>>> into it.
>>>>
>>>> Thanks for leaping onto it. I'll keep investigating it here as well.
>>>>   
>>>
>>> When I add some debugging to marvell_nfc_wait_op I see
>>>
>>> marvell-nfc f10d0000.flash: timeout_ms = 250
>>> marvell-nfc f10d0000.flash: done
>>> marvell-nfc f10d0000.flash: timeout_ms = 1
>>> marvell-nfc f10d0000.flash: done
>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>> nand: Macronix MX30LF2G18AC
>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>> Bad block table not found for chip 0
>>> Bad block table not found for chip 0
>>> Scanning device for bad blocks
>>> marvell-nfc f10d0000.flash: timeout_ms = 4
>>> marvell-nfc f10d0000.flash: done
>>> marvell-nfc f10d0000.flash: timeout_ms = 600000000
>>>
>>> That last line looks quite odd. I think the problem might be related to
>>> this line from marvell_nfc_hw_ecc_bch_write_page()
>>>
>>>     ret = marvell_nfc_wait_op(chip,
>>>                               chip->data_interface.timings.sdr.tPROG_max);
>>>
>>> Based on the datasheet that number is 600 microseconds(us) not the
>>> milliseconds expected by marvell_nfc_wait_op().
>>>    
>>
>> So naturally throwing in some PSEC_TO_MSEC() calls stopped the really
>> long timeouts but then the probe would fail. It seems that I'm getting
>> some "page done" and "command done" interrupts indications (NDSR =
>> 0x0000500) while attempting to write the oob data.
> 
> My bad, I might have forgotten one of these. Can you send a patch or
> show me which delay was wrong?

Here's the local change I have applied. Assuming my MUA doesn't mess up 
the formatting. I'm not 100% sure this is correct. The older pxa driver 
seemed to have a fixed 200ms delay for these operations.

--- 8< ---
Subject: [PATCH] mtd: rawnand: marvell: pass ms delay to wait_op

marvell_nfc_wait_op() expects the delay to be expressed in milliseconds
but nand_sdr_timings uses picoseconds. Use PSEC_TO_MSEC when passing
tPROG_max to marvell_nfc_wait_op().

Signed-off-by: Chris Packham <chris.packham at alliedtelesis.co.nz>
---
  drivers/mtd/nand/raw/marvell_nand.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/mtd/nand/raw/marvell_nand.c 
b/drivers/mtd/nand/raw/marvell_nand.c
index 1d779a35ac8e..e4b964fd40d8 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -1074,7 +1074,7 @@ static int 
marvell_nfc_hw_ecc_hmg_do_write_page(struct nand_chip *chip,
  		return ret;

  	ret = marvell_nfc_wait_op(chip,
-				  chip->data_interface.timings.sdr.tPROG_max);
+				  PSEC_TO_MSEC(chip->data_interface.timings.sdr.tPROG_max));
  	return ret;
  }

@@ -1494,7 +1494,7 @@ static int 
marvell_nfc_hw_ecc_bch_write_page(struct mtd_info *mtd,
  	}

  	ret = marvell_nfc_wait_op(chip,
-				  chip->data_interface.timings.sdr.tPROG_max);
+				  PSEC_TO_MSEC(chip->data_interface.timings.sdr.tPROG_max));

  	marvell_nfc_disable_hw_ecc(chip);

--- 8< ---

> 
> Can you also add a dump_stack() in the error path of the timeout
> (probably *wait_cmdd()) and show the full boot log?
> 

It's actually *wait_op(). Here's the output with a small debug patch 
applied on top of the delay changes above.

diff --git a/drivers/mtd/nand/raw/marvell_nand.c 
b/drivers/mtd/nand/raw/marvell_nand.c
index e4b964fd40d8..5af28c7f4487 100644
--- a/drivers/mtd/nand/raw/marvell_nand.c
+++ b/drivers/mtd/nand/raw/marvell_nand.c
@@ -627,6 +627,8 @@ static int marvell_nfc_wait_op(struct nand_chip 
*chip, unsigned int timeout_ms)
         marvell_nfc_disable_int(nfc, NDCR_RDYM);
         marvell_nfc_clear_int(nfc, NDSR_RDY(0) | NDSR_RDY(1));
         if (!ret) {
+               dev_info(nfc->dev, "NDSR %08x\n", readl(nfc->regs + NDSR));
+               dump_stack();
                 dev_err(nfc->dev, "Timeout waiting for RB signal\n");
                 return -ETIMEDOUT;
         }
marvell-nfc f10d0000.flash: NDSR 00000500
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc2-at1+ #3
Hardware name: Marvell Armada 380/385 (Device Tree)
[<80110f40>] (unwind_backtrace) from [<8010c350>] (show_stack+0x10/0x14)
[<8010c350>] (show_stack) from [<805c1274>] (dump_stack+0x88/0x9c)
[<805c1274>] (dump_stack) from [<803fdefc>] (marvell_nfc_wait_op+0xb0/0xc8)
[<803fdefc>] (marvell_nfc_wait_op) from [<803fe4e0>] 
(marvell_nfc_hw_ecc_bch_write_page+0x264/0x2d8)
[<803fe4e0>] (marvell_nfc_hw_ecc_bch_write_page) from [<803f6c64>] 
(nand_do_write_ops+0x328/0x438)
[<803f6c64>] (nand_do_write_ops) from [<803f6dc8>] 
(nand_write_oob+0x54/0x84)
[<803f6dc8>] (nand_write_oob) from [<803fa7a8>] (write_bbt+0x31c/0x720)
[<803fa7a8>] (write_bbt) from [<803fb364>] (nand_default_bbt+0x314/0x6fc)
[<803fb364>] (nand_default_bbt) from [<803f5a84>] 
(nand_scan_tail+0xa98/0xaf0)
[<803f5a84>] (nand_scan_tail) from [<803fed68>] 
(marvell_nand_chip_init+0x6b8/0x8ec)
[<803fed68>] (marvell_nand_chip_init) from [<803ff2dc>] 
(marvell_nfc_probe+0x340/0x38c)
[<803ff2dc>] (marvell_nfc_probe) from [<803bfdc8>] 
(platform_drv_probe+0x34/0x70)
[<803bfdc8>] (platform_drv_probe) from [<803be724>] 
(really_probe+0x230/0x2c8)
[<803be724>] (really_probe) from [<803be868>] (__driver_attach+0xac/0xbc)
[<803be868>] (__driver_attach) from [<803bcaa4>] 
(bus_for_each_dev+0x68/0xb4)
[<803bcaa4>] (bus_for_each_dev) from [<803bdcec>] 
(bus_add_driver+0x198/0x210)
[<803bdcec>] (bus_add_driver) from [<803bef78>] (driver_register+0x78/0xf8)
[<803bef78>] (driver_register) from [<80102ca4>] 
(do_one_initcall+0x50/0x19c)
[<80102ca4>] (do_one_initcall) from [<80800e3c>] 
(kernel_init_freeable+0x144/0x1e8)
[<80800e3c>] (kernel_init_freeable) from [<805d5300>] 
(kernel_init+0x8/0x110)
[<805d5300>] (kernel_init) from [<801010e8>] (ret_from_fork+0x14/0x2c)
Exception stack(0xbc037fb0 to 0xbc037ff8)
7fa0:                                     00000000 00000000 00000000 
00000000
7fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
00000000
7fe0: 00000000 00000000 00000000 00000000 00000013 00000000
marvell-nfc f10d0000.flash: Timeout waiting for RB signal
nand_bbt: error while writing BBT block -110

>>
>> I've also re-done some of my initial tests and it seems that 4.17-rc2
>> cannot mount this chip. The 4.16.4 kernel can.
>>
>> Even if I use the old kernel to create the ubi volumes the new kernel
>> seems to hang while mounting in a similar place to what I was seeing
>> with the BBT creation.
> 
> Thanks for your time,
> Miquèl
> 




More information about the linux-mtd mailing list