[BUG] pxa3xx: wait time out when scanning for bb

Tue Dec 12 08:22:36 PST 2017

On 12 December 2017 at 05:15, Boris Brezillon
<boris.brezillon at free-electrons.com> wrote:
> On Tue, 12 Dec 2017 15:30:03 +0800
> Greg Cook <greg at morpheus.ws> wrote:
>
>> On 12 December 2017 at 15:09, Ezequiel Garcia
>> <ezequiel at vanguardiasur.com.ar> wrote:
>> > On 12 December 2017 at 03:01, Greg Cook <greg at morpheus.ws> wrote:
>> >> On 12 December 2017 at 05:16, Boris Brezillon
>> >> <boris.brezillon at free-electrons.com> wrote:
>> >>> On Mon, 11 Dec 2017 13:24:56 -0300
>> >>> Ezequiel Garcia <ezequiel at vanguardiasur.com.ar> wrote:
>> >>>
>> >>>> Greg,
>> >>>>
>> >>>> On 11 December 2017 at 10:18, Greg Cook <greg at morpheus.ws> wrote:
>> >>>> > Sean,
>> >>>> >
>> >>>> > I am not completely up-to-date on this, but everything in your traces reads
>> >>>> > like the same issue I was having on bringup for Armada 385 nand (under 4.9).
>> >>>> > I've been stuck on another project, so I haven't had time to follow up
>> >>>> > further, but I just diffed against linux-stable v4.12 pxa3xx_nand.c and it
>> >>>> > looks like the problem is still there.
>> >>>> >
>> >>>> > As far as I can see, the driver is broken for OOB reads when BCH is enabled
>> >>>> > because the setup in prepare_set_command() results in drain_fifo() not
>> >>>> > reading enough words from the read fifo in the nfc2 IP block.
>> >>>
>> >>> May I ask how you you get to this conclusion? What makes you think
>> >>> there's still unread data in the FIFO?
>> >>
>> >> Because I added some debug lines to count the number of bytes read and
>> >> I could  see handle_data_pio() was reading 2048 bytes, followed by 32
>> >> bytes of spare. If you have Marvell MV-S109094-00 REV C handy, you can
>> >> see in Table 36 that this is not correct. The correct number of spare
>> >> bytes for the NFCv2 IP block is
>> >> - 64 bytes when SPARE_EN==1 and ECC_EN==0
>> >> - 32 bytes when SPARE_EN==1 and ECC_EN==1
>> >>
>> >>>> >
>> >>>>
>> >>>> Yes, this sounds just like the bug I was expecting for OOB reads.
>> >>>>
>> >>>> > The patch we are using is below. I have the following in my DTS.
>> >>>> > nand-keep-config is commented out because I was having some issues with
>> >>>> > u-boot at the time and it may no longer be relevant:
>> >>>>
>> >>>> Probably.
>> >>>>
>> >>>> > flash at d0000 {
>> >>>> > status = "okay";
>> >>>> > num-cs = <1>;
>> >>>> > //marvell,nand-keep-config;
>> >>>> > marvell,nand-enable-arbiter;
>> >>>> > nand-on-flash-bbt;
>> >>>> > nand-ecc-strength = <4>;
>> >>>> > nand-ecc-step-size = <512>;
>> >>>> > };
>> >>>> >
>> >>>> > --- /home/user/build/linux-stable/drivers/mtd/nand/pxa3xx_nand.c
>> >>>> > +++
>> >>>> > /home/user/build/beam/openwrt/build_dir/target-arm_cortex-a9+vfpv3_musl_eabi/linux-mvebu/linux-4.9.34/drivers/mtd/nand/pxa3xx_nand.c
>> >>>> > @@ -668,7 +669,7 @@
>> >>>> >
>> >>>> >  static void drain_fifo(struct pxa3xx_nand_info *info, void *data, int len)
>> >>>> >  {
>> >>>> > - if (info->ecc_bch) {
>> >>>> > + if (info->use_ecc && info->ecc_bch) {
>> >>>
>> >>> This one might explain timeouts occurring when you drain the FIFO for
>> >>> an operation that does not enable the ECC engine (like
>> >>> READ_PARAM_PAGE).
>> >>>
>> >>> So this fix is indeed valid, but I'm almost sure it won't fix Sean's
>> >>> problem.
>> >>
>> >> From memory, I don't *think* specific line wasn't causing me any
>> >> issues, but I noticed it was not correct because of the way use_ecc
>> >> and ecc_bch flags are initialised and then used. As you say, without
>> >> this patch, you can get timeouts on non-ECC operations when you have
>> >> BCH ECC enabled generally.
>> >>
>> >>>
>> >>>> >   u32 val;
>> >>>> >   int ret;
>> >>>> >
>> >>>> > @@ -1012,7 +1014,11 @@
>> >>>> >
>> >>>> >   if (info->cur_chunk < info->nfullchunks) {
>> >>>> >   info->step_chunk_size = info->chunk_size;
>> >>>> > - info->step_spare_size = info->spare_size;
>> >>>> > + if (info->use_ecc) {
>> >>>> > + info->step_spare_size = info->spare_size;
>> >>>> > + } else {
>> >>>> > + info->step_spare_size = info->spare_size + info->ecc_size;
>> >>>> > + }
>> >>>
>> >>> The only case this change would fix is when you try to read/write pages
>> >>> in raw mode, and I'm pretty sure this driver does not support raw
>> >>> accesses.
>> >>
>> >> Why do you say it would only affect raw mode? This code is not
>> >> specific to raw mode, nor
>> >> is the related code in handle_data_pio(), which uses the value of
>> >> step_spare_size to drain
>> >> the hardware FIFO.
>> >>
>> >
>> > AFAICS, this patch affects READOOB commands, which are
>> > issued when there is no BBT to detect bad blocks.
>>
>> Agree that this only really affects READOOB, because READ0 will always
>> use ECC. I guess this is why it's only catching people at bringup
>> because we are booting into Linux with a blank device.
>

Right, READ0 use ECC. But also, OOB writes are not fixed by this
patch. Proper OOB support is something this driver doesn't support
well.

> I overlooked the READOOB case. What I don't understand though is why we
> are disabling ECC in this case. I mean, free OOB bytes seem to be
> ECC protected (at least for BCH), so ECC should stay enabled when
> reading those bytes, unless we're in raw mode.
>

You can't read those bytes with ECC enabled. AFAICR, the controller
simply won't allow you to read the OBB region, because it's using it
for ECC.

>>
>> >
>> > It "fixes" the timeout, but I wouldn't go as far as assuring
>> > it makes the on-flash bad block scheme work, given
>> > OOB writes will still be done with ECC enabled.
>> >
>> > It seems this drivers mandates a BBT to work properly,
>> > so perhaps a better patch would be to (a) force BBT usage,
>> > and avoid OOB operations altogether, avoiding nasty
>> > timeouts and unhappy board-bringupers.
>> >
>> > Or (b) fix the READOOB and then
>> > WARN/pr_warning about OOB I/O being broken.
>> >
>> > The new driver under discussion, apparently makes AOOB
>> > actually usable (arguably, without much utility as OOB
>> > is mostly used for ECC anyway).
>> >
>> > Ideas?
>>
>> Does mandating BBT usage avoid the bug? Aren't you always going to hit
>> it on a blank NAND device straight off the production line?
>
> The timeout bug should be fixed. The fact that a valid BBT is required
> after the device has been programmed (because BBM are screwed up by the
> controller) is orthogonal to timeout on READOOB commands.
>

Right. This patch is required. I overlooked the fact that we need to
read the BBM to create the BBT in the first place.

Once the BBT is created, I don't think OOB will be needed.
-- 
Ezequiel García, VanguardiaSur
www.vanguardiasur.com.ar