NAND timeout issues with blank chip and Marvell NFC

Wed Apr 25 22:16:57 PDT 2018

An update for the end of my working day.

On 26/04/18 13:40, Chris Packham wrote:
> On 26/04/18 09:22, Chris Packham wrote:
>> Hi Miquel,
>>
>> On 25/04/18 04:08, Miquel Raynal wrote:
>>> Hi Steve, Chris,
>>>
>>> On Tue, 24 Apr 2018 08:49:47 -0700, Steve deRosier <derosier at gmail.com>
>>> wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> On Mon, Apr 23, 2018 at 10:31 PM, Chris Packham
>>>> <Chris.Packham at alliedtelesis.co.nz> wrote:
>>>>> Hi,
>>>>>
>>>>> We're in the process of qualifying new NAND chips (Macronix
>>>>> MX30LF2G18AC) for one of our Armada-385 based devices and we're
>>>>> experiencing some long startup times on units with factory fresh NAND
>>>>> chips. Anecdotally I think I've also seen this behaviour on the old
>>>>> chips as well (Micron MT29F2G08ABAEAWP-ITX:E).
>>>>>
>>>>> On 4.17.0-rc2 with the newly re-written NAND infrastructure we see
>>>>>
>>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>>> nand: Macronix MX30LF2G18AC
>>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>>> marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)
>>>>> Bad block table not found for chip 0
>>>>> Bad block table not found for chip 0
>>>>> Scanning device for bad blocks
>>>>>
>>>>> (nothing for some time)
>>>>>
>>>>> On an older kernel we see
>>>>>
>>>>> pxa3xx-nand f10d0000.flash: This platform can't do DMA on this device
>>>>> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
>>>>> nand: Macronix MX30LF2G18AC
>>>>> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
>>>>> pxa3xx-nand f10d0000.flash: ECC strength 16, ECC step size 2048
>>>>> Bad block table not found for chip 0
>>>>> Bad block table not found for chip 0
>>>>> Scanning device for bad blocks
>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>> pxa3xx-nand f10d0000.flash: Wait time out!!!
>>>>> ...
>>>>> (time outs continue for some time)
>>>>>
>>>>> Presumably the new driver in 4.17.0-rc2 is experiencing the same wait
>>>>> time out but just not complaining about it.
>>>>>
>>>>> If we leave the system running long enough (in the order of 30 minutes)
>>>>> things seem to sort themselves out and bootup continues, the subsequent
>>>>> boots are fine. If we run 'nand erase.chip' from u-boot on a fresh unit
>>>>> and then boot into the kernel then things are also fine.
>>>>>
>>>>> If we run 'nand scrub.chip -y' from u-boot we are able to re-create the
>>>>> problem.
>>>>>
>>>>> Our suspicion is that erased state of the chip is probably not agreeable
>>>>> with either the ecc data or the bad block table location (or both). By
>>>>> erasing it from u-boot this must fill in valid data in the expected
>>>>> places and the kernel is happy.
>>>>>     
>>>>
>>>> During your very first boot, Linux can't find the bad-block table and
>>>> thus does a full scan of the chip, each and every block, to find the
>>>> manufacturer bad block marks and then constructs the table. I imagine
>>>> you've got a parameter incorrect somewhere that's causing it to wait
>>>> for timeouts at read points, instead of quickly able to read through
>>>> the 2k or 4k blocks on that flash.  On subsequent boots, you don't see
>>>> this issue because the BBT is found and Linux just uses that. Same
>>>> deal if you do a `nand erase.chip`, because the BBT is itself marked
>>>> with a bad-block marker and gets skipped during a normal erase.
>>>
>>> I share Steve's thoughts on that, there is probably some
>>> misconfiguration at some point, having a first long boot is not a
>>> problem, but 30 minutes for a 256MiB chip... What I don't understand is
>>> that you should have timeouts with the recent kernel too if there is
>>> actually something wrong happening.
>>
>> As I mentioned in my other reply I may have understated the time. It is
>> ~30mins with the old pxa3xx driver but the new one seems to block
>> indefinitely for me.
>>
>>>>
>>>> Now, I don't know if you're aware of this, but by doing the `nand
>>>> scub.chip -y`, you've ruined the flash chip.  That device can not be
>>>> relied upon anymore. A scrub will ignore the factory bad-block-marks
>>>> and erase them. Unless you stored this information off-chip and
>>>> rewrite the markers, you've now lost the bad-block information from
>>>> the manufacturer's tests.  In any case, this erases the BBT, so your
>>>> next boot triggers Linux to rebuild the BBT.
>>>
>>> I think U-Boot will do it automatically after the scrub. But the result
>>> is still the same.
>>>
>>>>
>>>>> We could update our manufacturing procedures to run 'nand erase.chip'
>>>>> before the first boot but this feels wrong. Some of our devices boot
>>>>> over the network so the nand is not normally touched by the bootloader.
>>>>> It seems that there is some unhandled error condition that is stopping
>>>>> the kernel from seeing that the chip is completely blank and making
>>>>> forward progress.
>>>>>     
>>>>
>>>> erase chip won't fix your issue. The BBT scan is going to happen
>>>> anyway. There is however clearly some parameter that is setup
>>>> incorrectly that's causing it to wait for the timeout instead of being
>>>> able to quickly read pages. I don't see why that'd be unique to the
>>>> BBT scan however, I'd expect you to see the problem on all reads, thus
>>>> slowing down the system noticeably in general.
>>>>
>>>> Your hint is likely these lines:
>>>>        " marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000080)
>>>>          marvell-nfc f10d0000.flash: Timeout on CMDD (NDSR: 0x00000280)"
>>>>
>>>> You can go look at that in the driver and compare with the relevant
>>>> behavior in the datasheets. Sorry, but I can't help more specifically,
>>>> I'd have to know your particular hardware and datasheets and spend
>>>> some time looking at the code.
>>>
>>> I also reproduce the problem on my Armada 38x, the two timeouts at boot
>>> time (not specifically the first one) are suspicious, I'm going to look
>>> into it.
>>
>> Thanks for leaping onto it. I'll keep investigating it here as well.
>>
> 
> When I add some debugging to marvell_nfc_wait_op I see
> 
> marvell-nfc f10d0000.flash: timeout_ms = 250
> marvell-nfc f10d0000.flash: done
> marvell-nfc f10d0000.flash: timeout_ms = 1
> marvell-nfc f10d0000.flash: done
> nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xda
> nand: Macronix MX30LF2G18AC
> nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> Bad block table not found for chip 0
> Bad block table not found for chip 0
> Scanning device for bad blocks
> marvell-nfc f10d0000.flash: timeout_ms = 4
> marvell-nfc f10d0000.flash: done
> marvell-nfc f10d0000.flash: timeout_ms = 600000000
> 
> That last line looks quite odd. I think the problem might be related to
> this line from marvell_nfc_hw_ecc_bch_write_page()
> 
>    ret = marvell_nfc_wait_op(chip,
>                              chip->data_interface.timings.sdr.tPROG_max);
> 
> Based on the datasheet that number is 600 microseconds(us) not the
> milliseconds expected by marvell_nfc_wait_op().
> 

So naturally throwing in some PSEC_TO_MSEC() calls stopped the really 
long timeouts but then the probe would fail. It seems that I'm getting 
some "page done" and "command done" interrupts indications (NDSR = 
0x0000500) while attempting to write the oob data.

I've also re-done some of my initial tests and it seems that 4.17-rc2 
cannot mount this chip. The 4.16.4 kernel can.

Even if I use the old kernel to create the ubi volumes the new kernel 
seems to hang while mounting in a similar place to what I was seeing 
with the BBT creation.