imx27: No space left to write bad block table

Mon Apr 26 16:53:39 BST 2021

Hi Miquel,

On Mon, 2021-04-19 at 17:36 +0200, Miquel Raynal wrote:
> Hi Stefan,
> 
> > > Interesting. Maybe I overlooked the below commit when applying. Indeed,
> > > BBT may be considered as bad blocks, so I wonder if the below change is
> > > valid now...
> > > 
> > > Guillaume, would you have a way to revert this patch on top of
> > > linux-next? Stefan, would you mind giving more details on the testing
> > > procedure?  
> > 
> > I have tested this on an i.MX 6 by simulating two bad BBT blocks by simply
> > returning -EIO in nand_erase_nand when the block to be erased is one of
> > the
> > first two BBT blocks.
> > 
> > I have seen this once on a customer board but were not able to reproduce
> > it
> > anymore, thus the simulation of the two bad blocks.
> > 
> > Without the patch below new versions of the BBT can no longer be written
> > to
> > the first two blocks reserved for the BBT but they are still evaluated to
> > read
> > the BBT from during boot due the lack of a test if these blocks are bad.
> > So
> > changes to the BBT after these two blocks turn bad are only kept and used
> > until the next reboot where again the old version of the two worn blocks
> > is
> > used as a basis.
> > 
> > I tried to use the same mechanism that is used to identify bad blocks
> > during a
> > scan for bad blocks. But maybe I missed something there? Or were my
> > assumptions wrong in the first place?
> 
> Honestly I don't know what is wrong exactly in this patch.
> 
> We will revert the commit as it clearly breaks something fundamental
> and the merge window is too close to adopt a hackish attitude.
> 
> I would propose the following tests with your board:
> - Hack the core to allow yourself to access bad blocks from userspace
>   for testing purposes.
> - With the below commit, you should have the same behavior than
>   reported by Fabio.

On my imx6 board the patch does not lead to the behavior reported by Fabio.
The BBT is found and can be read:

[    1.520501] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xd3
[    1.526944] nand: Macronix MX60LF8G18AC
[    1.530803] nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
size: 64
[    1.539412] Bad block table found at page 524224, version 0x01
[    1.545790] Bad block table found at page 524160, version 0x01
[    1.551796] nand_read_bbt: bad block at 0x000001b60000
[    1.557032] nand_read_bbt: bad block at 0x000008cc0000
[    1.562204] nand_read_bbt: bad block at 0x00000f480000
[    1.567395] nand_read_bbt: bad block at 0x0000111c0000
[    1.572588] nand_read_bbt: bad block at 0x0000205c0000
[    1.577802] nand_read_bbt: bad block at 0x00002dfc0000

I dug a little deeper and I think I found the cause for the failure on the
imx27 board.

The mxc_nand driver (used by the imx27) uses its own nand_bbt_descr with an
offset of 0 in the OOB area. This is the same place the bad block marker is
located on worn or factory bad blocks.

This explains why the BBT is no longer found with my patch. scan_block_fast
checks if there is anything else than 0xff in the bad block marker and finds
the 'B' from 'Bbt0'. The same occurs for the mirrored version where it finds
the '1' from '1tbB'. 

This also explains why the original BBT is detected as bad blocks in the scan
after the BBT was not found, which results in the BBT being written to the
remaining two blocks reserved for the BBT.

19:38:23.001385  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
19:38:23.002635  nand: ST Micro NAND01GR3B2CZA6
19:38:23.006666  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
size: 64
19:38:23.028413  Bad block table not found for chip 0
19:38:23.035625  random: fast init done
19:38:23.049144  Bad block table not found for chip 0
19:38:23.050024  Scanning device for bad blocks
19:38:23.330999  Bad eraseblock 329 at 0x000002920000
19:38:23.345958  Bad eraseblock 330 at 0x000002940000
19:38:23.356024  Bad eraseblock 331 at 0x000002960000
19:38:23.365738  Bad eraseblock 332 at 0x000002980000
19:38:23.375590  Bad eraseblock 333 at 0x0000029a0000
19:38:23.385505  Bad eraseblock 334 at 0x0000029c0000
19:38:23.395548  Bad eraseblock 335 at 0x0000029e0000
19:38:23.405501  Bad eraseblock 336 at 0x000002a00000
19:38:23.415551  Bad eraseblock 337 at 0x000002a20000
19:38:23.425937  Bad eraseblock 338 at 0x000002a40000
19:38:23.436028  Bad eraseblock 339 at 0x000002a60000
19:38:23.445959  Bad eraseblock 340 at 0x000002a80000
19:38:23.456008  Bad eraseblock 341 at 0x000002aa0000
19:38:23.466006  Bad eraseblock 342 at 0x000002ac0000
19:38:23.475912  Bad eraseblock 343 at 0x000002ae0000
19:38:23.486064  Bad eraseblock 344 at 0x000002b00000
19:38:23.495925  Bad eraseblock 345 at 0x000002b20000
19:38:24.048053  Bad eraseblock 1022 at 0x000007fc0000
19:38:24.056117  Bad eraseblock 1023 at 0x000007fe0000
19:38:24.067953  Bad block table written to 0x000007fa0000, version 0x01
19:38:24.087637  Bad block table written to 0x000007f80000, version 0x01

On the next boot all four BBT version in flash are skipped for the same reason
as before and the two blocks containing the latest BBT are also detected as
bad blocks. The result is no more remaining blocks to write the BBT to.

21:22:55.032595  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
21:22:55.033333  nand: ST Micro NAND01GR3B2CZA6
21:22:55.037804  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB
size: 64
21:22:55.088475  Bad block table not found for chip 0
21:22:55.093807  Bad block table not found for chip 0
21:22:55.105995  Scanning device for bad blocks
21:22:55.109049  random: fast init done
21:22:55.395488  Bad eraseblock 329 at 0x000002920000
21:22:55.406832  Bad eraseblock 330 at 0x000002940000
21:22:55.416885  Bad eraseblock 331 at 0x000002960000
21:22:55.426736  Bad eraseblock 332 at 0x000002980000
21:22:55.436732  Bad eraseblock 333 at 0x0000029a0000
21:22:55.446864  Bad eraseblock 334 at 0x0000029c0000
21:22:55.456662  Bad eraseblock 335 at 0x0000029e0000
21:22:55.466785  Bad eraseblock 336 at 0x000002a00000
21:22:55.476801  Bad eraseblock 337 at 0x000002a20000
21:22:55.486772  Bad eraseblock 338 at 0x000002a40000
21:22:55.496768  Bad eraseblock 339 at 0x000002a60000
21:22:55.506607  Bad eraseblock 340 at 0x000002a80000
21:22:55.516965  Bad eraseblock 341 at 0x000002aa0000
21:22:55.526621  Bad eraseblock 342 at 0x000002ac0000
21:22:55.536702  Bad eraseblock 343 at 0x000002ae0000
21:22:55.546660  Bad eraseblock 344 at 0x000002b00000
21:22:55.556745  Bad eraseblock 345 at 0x000002b20000
21:22:56.172928  Bad eraseblock 1020 at 0x000007f80000
21:22:56.187043  Bad eraseblock 1021 at 0x000007fa0000
21:22:56.197437  Bad eraseblock 1022 at 0x000007fc0000
21:22:56.212665  Bad eraseblock 1023 at 0x000007fe0000
21:22:56.213356  No space left to write bad block table
21:22:56.215012  nand_bbt: error while writing bad block table -28
21:22:56.239353  mxc_nand: probe of d8000000.nand-controller failed with error
-28

I'm not sure of the best way to address this issue. A few ideas came into my
mind:

- Shift the offset of the nand_bbt_descr of mxc_nand to make room for the bad
block marker. But I'm not sure if this would already conflict with the ECC
hardware but the ooblayout functions would suggest that it could work. 

---8<---
static int mxc_v1_ooblayout_free(struct mtd_info *mtd, int section,
                                 struct mtd_oob_region *oobregion)
{   
        struct nand_chip *nand_chip = mtd_to_nand(mtd);

        if (section > nand_chip->ecc.steps)
                return -ERANGE;

        if (!section) {
                if (mtd->writesize <= 512) {
                        oobregion->offset = 0;
                        oobregion->length = 5;
                } else {
                        oobregion->offset = 2;
                        oobregion->length = 4;
                }
        } else {
                oobregion->offset = ((section - 1) * 16) + MXC_V1_ECCBYTES +
6;
                if (section < nand_chip->ecc.steps)
                        oobregion->length = (section * 16) + 6 -
                                            oobregion->offset;
                else
                        oobregion->length = mtd->oobsize - oobregion->offset;
        }   

        return 0;
}
---8<---

Unfortunately I don't have any hardware at hand at the moment to test it. I
think the distinction between small and large pagesizes needs to be reflected
on the bbt_descr as well.

- Use NAND_BBT_NO_OOB with the mxc_nand driver since there is a comment saying
there is an overlap between the generic bbt descriptors and the ECC hardware.
I'm not sure what other effects it might have to set NAND_BBT_NO_OOB.

- Explicitly check for the bad block marker during a search for the BBT
instead of using scan_block_fast

Any suggestions?

Regards,
Stefan

> - Revert the commit.
> - Manually change the bad block markers (nanddump, flash_erase,
>   nandwrite) to declare the two tables bad. Reboot and observe if there
>   are any issues. You can try to work from there.
> 
> > > ---8<---
> > > 
> > > commit bd9c9fe2ad04546940f4a9979d679e62cae6aa51
> > > Author: Stefan Riedmueller <s.riedmueller at phytec.de>
> > > Date:   Thu Mar 25 11:23:37 2021 +0100
> > > 
> > >     mtd: rawnand: bbt: Skip bad blocks when searching for the BBT in
> > > NAND
> > >     
> > >     The blocks containing the bad block table can become bad as well. So
> > >     make sure to skip any blocks that are marked bad when searching for
> > > the
> > >     bad block table.
> > >     
> > >     Otherwise in very rare cases where two BBT blocks wear out it might
> > >     happen that an obsolete BBT is used instead of a newer available
> > >     version.
> > >     
> > >     Signed-off-by: Stefan Riedmueller <s.riedmueller at phytec.de>
> > >     Signed-off-by: Miquel Raynal <miquel.raynal at bootlin.com>
> > >     Link: 
> > > https://lore.kernel.org/linux-mtd/20210325102337.481172-1-s.riedmueller@phytec.de
> > > 
> > > diff --git a/drivers/mtd/nand/raw/nand_bbt.c
> > > b/drivers/mtd/nand/raw/nand_bbt.c
> > > index dced32a126d9..6e25a5ce5ba9 100644
> > > --- a/drivers/mtd/nand/raw/nand_bbt.c
> > > +++ b/drivers/mtd/nand/raw/nand_bbt.c
> > > @@ -525,6 +525,7 @@ static int search_bbt(struct nand_chip *this,
> > > uint8_t
> > > *buf,
> > >  {
> > >         u64 targetsize = nanddev_target_size(&this->base);
> > >         struct mtd_info *mtd = nand_to_mtd(this);
> > > +       struct nand_bbt_descr *bd = this->badblock_pattern;
> > >         int i, chips;
> > >         int startblock, block, dir;
> > >         int scanlen = mtd->writesize + mtd->oobsize;
> > > @@ -560,6 +561,10 @@ static int search_bbt(struct nand_chip *this,
> > > uint8_t
> > > *buf,
> > >                         int actblock = startblock + dir * block;
> > >                         loff_t offs = (loff_t)actblock << this-  
> > > > bbt_erase_shift;  
> > >  
> > > +                       /* Check if block is marked bad */
> > > +                       if (scan_block_fast(this, bd, offs, buf))
> > > +                               continue;
> > > +
> > >                         /* Read first page */
> > >                         scan_read(this, buf, offs, mtd->writesize, td);
> > >                         if (!check_pattern(buf, scanlen, mtd->writesize,
> > > td)) {
> > > 
> > > 
> > > Thanks,
> > > Miquèl  
> 
> Thanks,
> Miquèl