imx27: No space left to write bad block table

Mon Apr 19 14:04:05 BST 2021

Hi Miquel, Fabio,

On Mon, 2021-04-19 at 14:27 +0200, Miquel Raynal wrote:
> Hi Fabio, Guillaume,
> 
> +Stephan
> 
> Fabio Estevam <festevam at gmail.com> wrote on Mon, 19 Apr 2021 08:47:56
> -0300:
> 
> > Hi Miquel,
> > 
> > On Mon, Apr 19, 2021 at 3:37 AM Miquel Raynal <miquel.raynal at bootlin.com>
> > wrote:
> > > Hi Fabio,
> > > 
> > > Fabio Estevam <festevam at gmail.com> wrote on Sat, 17 Apr 2021 12:59:22
> > > -0300:
> > >  
> > > > Hi,
> > > > 
> > > > I noticed this error recently on a imx27-phytec-phycard-s-rdk reported
> > > > on kernelci:
> > > > 
> > > > nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
> > > > nand: ST Micro NAND01GR3B2CZA6
> > > > nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> > > > Bad block table not found for chip 0
> > > > Bad block table not found for chip 0
> > > > Scanning device for bad blocks
> > > > random: fast init done
> > > > Bad eraseblock 329 at 0x000002920000
> > > > Bad eraseblock 330 at 0x000002940000
> > > > Bad eraseblock 331 at 0x000002960000
> > > > Bad eraseblock 332 at 0x000002980000
> > > > Bad eraseblock 333 at 0x0000029a0000
> > > > Bad eraseblock 334 at 0x0000029c0000
> > > > Bad eraseblock 335 at 0x0000029e0000
> > > > Bad eraseblock 336 at 0x000002a00000
> > > > Bad eraseblock 337 at 0x000002a20000
> > > > Bad eraseblock 338 at 0x000002a40000
> > > > Bad eraseblock 339 at 0x000002a60000
> > > > Bad eraseblock 340 at 0x000002a80000
> > > > Bad eraseblock 341 at 0x000002aa0000
> > > > Bad eraseblock 342 at 0x000002ac0000
> > > > Bad eraseblock 343 at 0x000002ae0000
> > > > Bad eraseblock 344 at 0x000002b00000
> > > > Bad eraseblock 345 at 0x000002b20000
> > > > Bad eraseblock 1020 at 0x000007f80000
> > > > Bad eraseblock 1021 at 0x000007fa0000
> > > > Bad eraseblock 1022 at 0x000007fc0000
> > > > Bad eraseblock 1023 at 0x000007fe0000
> > > > No space left to write bad block table
> > > > nand_bbt: error while writing bad block table -28
> > > > mxc_nand: probe of d8000000.nand-controller failed with error -28
> > > > 
> > > > Full log:
> > > > https://storage.kernelci.org/next/master/next-20210416/arm/imx_v4_v5_defconfig/gcc-8/lab-pengutronix/baseline-imx27-phytec-phycard-s-rdk.html
> > > > 
> > > > I don't have access to this board but just wanted to report it.  
> > > 
> > > Thanks for the report!
> > > 
> > > Indeed that's a misbehavior, this happens when *something* is not
> > > happening correctly and the board boots over and over, each time
> > > decrementing the block supposed to contain the BBT until there are none
> > > available anymore. However I'm not sure this has been caused by a
> > > recent issue as there have not been major changes in the core nor in
> > > this driver since your last fix. Maybe this is a leftover of the
> > > previous situation. Would this be possible? Do you have a mean to find
> > > out the day/kernel version which started failing?  
> > 
> > I know it does not happen on master, only on linux-next.
> > 
> > The oldest linux-next log I see listed for the
> > imx27-phytec-phycard-s-rdk board that I see on kernelci is 20210401,
> > which is also affected.
> > 
> > Adding Guillaume in case kernelci could help to find the commit that
> > causes the "No space left to write bad block table" message to appear.
> 
> Interesting. Maybe I overlooked the below commit when applying. Indeed,
> BBT may be considered as bad blocks, so I wonder if the below change is
> valid now...
> 
> Guillaume, would you have a way to revert this patch on top of
> linux-next? Stefan, would you mind giving more details on the testing
> procedure?

I have tested this on an i.MX 6 by simulating two bad BBT blocks by simply
returning -EIO in nand_erase_nand when the block to be erased is one of the
first two BBT blocks.

I have seen this once on a customer board but were not able to reproduce it
anymore, thus the simulation of the two bad blocks.

Without the patch below new versions of the BBT can no longer be written to
the first two blocks reserved for the BBT but they are still evaluated to read
the BBT from during boot due the lack of a test if these blocks are bad. So
changes to the BBT after these two blocks turn bad are only kept and used
until the next reboot where again the old version of the two worn blocks is
used as a basis.

I tried to use the same mechanism that is used to identify bad blocks during a
scan for bad blocks. But maybe I missed something there? Or were my
assumptions wrong in the first place?

Regards,
Stefan

> 
> ---8<---
> 
> commit bd9c9fe2ad04546940f4a9979d679e62cae6aa51
> Author: Stefan Riedmueller <s.riedmueller at phytec.de>
> Date:   Thu Mar 25 11:23:37 2021 +0100
> 
>     mtd: rawnand: bbt: Skip bad blocks when searching for the BBT in NAND
>     
>     The blocks containing the bad block table can become bad as well. So
>     make sure to skip any blocks that are marked bad when searching for the
>     bad block table.
>     
>     Otherwise in very rare cases where two BBT blocks wear out it might
>     happen that an obsolete BBT is used instead of a newer available
>     version.
>     
>     Signed-off-by: Stefan Riedmueller <s.riedmueller at phytec.de>
>     Signed-off-by: Miquel Raynal <miquel.raynal at bootlin.com>
>     Link: 
> https://lore.kernel.org/linux-mtd/20210325102337.481172-1-s.riedmueller@phytec.de
> 
> diff --git a/drivers/mtd/nand/raw/nand_bbt.c
> b/drivers/mtd/nand/raw/nand_bbt.c
> index dced32a126d9..6e25a5ce5ba9 100644
> --- a/drivers/mtd/nand/raw/nand_bbt.c
> +++ b/drivers/mtd/nand/raw/nand_bbt.c
> @@ -525,6 +525,7 @@ static int search_bbt(struct nand_chip *this, uint8_t
> *buf,
>  {
>         u64 targetsize = nanddev_target_size(&this->base);
>         struct mtd_info *mtd = nand_to_mtd(this);
> +       struct nand_bbt_descr *bd = this->badblock_pattern;
>         int i, chips;
>         int startblock, block, dir;
>         int scanlen = mtd->writesize + mtd->oobsize;
> @@ -560,6 +561,10 @@ static int search_bbt(struct nand_chip *this, uint8_t
> *buf,
>                         int actblock = startblock + dir * block;
>                         loff_t offs = (loff_t)actblock << this-
> >bbt_erase_shift;
>  
> +                       /* Check if block is marked bad */
> +                       if (scan_block_fast(this, bd, offs, buf))
> +                               continue;
> +
>                         /* Read first page */
>                         scan_read(this, buf, offs, mtd->writesize, td);
>                         if (!check_pattern(buf, scanlen, mtd->writesize,
> td)) {
> 
> 
> Thanks,
> Miquèl