imx27: No space left to write bad block table

Mon May 10 09:38:23 BST 2021

Hi Miquel,

On Tue, 2021-05-04 at 10:34 +0200, Miquel Raynal wrote:
> Hi Stefan,
> 
> Stefan Riedmüller <S.Riedmueller at phytec.de> wrote on Mon, 26 Apr 2021
> 15:53:39 +0000:
> 
> > Hi Miquel,
> > 
> > On Mon, 2021-04-19 at 17:36 +0200, Miquel Raynal wrote:
> > > Hi Stefan,
> > >   
> > > > > Interesting. Maybe I overlooked the below commit when applying.
> > > > > Indeed,
> > > > > BBT may be considered as bad blocks, so I wonder if the below change
> > > > > is
> > > > > valid now...
> > > > > 
> > > > > Guillaume, would you have a way to revert this patch on top of
> > > > > linux-next? Stefan, would you mind giving more details on the
> > > > > testing
> > > > > procedure?    
> > > > 
> > > > I have tested this on an i.MX 6 by simulating two bad BBT blocks by
> > > > simply
> > > > returning -EIO in nand_erase_nand when the block to be erased is one
> > > > of
> > > > the
> > > > first two BBT blocks.
> > > > 
> > > > I have seen this once on a customer board but were not able to
> > > > reproduce
> > > > it
> > > > anymore, thus the simulation of the two bad blocks.
> > > > 
> > > > Without the patch below new versions of the BBT can no longer be
> > > > written
> > > > to
> > > > the first two blocks reserved for the BBT but they are still evaluated
> > > > to
> > > > read
> > > > the BBT from during boot due the lack of a test if these blocks are
> > > > bad.
> > > > So
> > > > changes to the BBT after these two blocks turn bad are only kept and
> > > > used
> > > > until the next reboot where again the old version of the two worn
> > > > blocks
> > > > is
> > > > used as a basis.
> > > > 
> > > > I tried to use the same mechanism that is used to identify bad blocks
> > > > during a
> > > > scan for bad blocks. But maybe I missed something there? Or were my
> > > > assumptions wrong in the first place?  
> > > 
> > > Honestly I don't know what is wrong exactly in this patch.
> > > 
> > > We will revert the commit as it clearly breaks something fundamental
> > > and the merge window is too close to adopt a hackish attitude.
> > > 
> > > I would propose the following tests with your board:
> > > - Hack the core to allow yourself to access bad blocks from userspace
> > >   for testing purposes.
> > > - With the below commit, you should have the same behavior than
> > >   reported by Fabio.  
> > 
> > On my imx6 board the patch does not lead to the behavior reported by
> > Fabio.
> > The BBT is found and can be read:
> > 
> > [    1.520501] nand: device found, Manufacturer ID: 0xc2, Chip ID: 0xd3
> > [    1.526944] nand: Macronix MX60LF8G18AC
> > [    1.530803] nand: 1024 MiB, SLC, erase size: 128 KiB, page size: 2048,
> > OOB
> > size: 64
> > [    1.539412] Bad block table found at page 524224, version 0x01
> > [    1.545790] Bad block table found at page 524160, version 0x01
> > [    1.551796] nand_read_bbt: bad block at 0x000001b60000
> > [    1.557032] nand_read_bbt: bad block at 0x000008cc0000
> > [    1.562204] nand_read_bbt: bad block at 0x00000f480000
> > [    1.567395] nand_read_bbt: bad block at 0x0000111c0000
> > [    1.572588] nand_read_bbt: bad block at 0x0000205c0000
> > [    1.577802] nand_read_bbt: bad block at 0x00002dfc0000
> > 
> > I dug a little deeper and I think I found the cause for the failure on the
> > imx27 board.
> > 
> > The mxc_nand driver (used by the imx27) uses its own nand_bbt_descr with
> > an
> > offset of 0 in the OOB area. This is the same place the bad block marker
> > is
> > located on worn or factory bad blocks.
> > 
> > This explains why the BBT is no longer found with my patch.
> > scan_block_fast
> > checks if there is anything else than 0xff in the bad block marker and
> > finds
> > the 'B' from 'Bbt0'. The same occurs for the mirrored version where it
> > finds
> > the '1' from '1tbB'. 
> 
> Ok, that's the reason why the original logic failed, thanks for looking
> for it.
> 
> > This also explains why the original BBT is detected as bad blocks in the
> > scan
> > after the BBT was not found, which results in the BBT being written to the
> > remaining two blocks reserved for the BBT.
> > 
> > 19:38:23.001385  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
> > 19:38:23.002635  nand: ST Micro NAND01GR3B2CZA6
> > 19:38:23.006666  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048,
> > OOB
> > size: 64
> > 19:38:23.028413  Bad block table not found for chip 0
> > 19:38:23.035625  random: fast init done
> > 19:38:23.049144  Bad block table not found for chip 0
> > 19:38:23.050024  Scanning device for bad blocks
> > 19:38:23.330999  Bad eraseblock 329 at 0x000002920000
> > 19:38:23.345958  Bad eraseblock 330 at 0x000002940000
> > 19:38:23.356024  Bad eraseblock 331 at 0x000002960000
> > 19:38:23.365738  Bad eraseblock 332 at 0x000002980000
> > 19:38:23.375590  Bad eraseblock 333 at 0x0000029a0000
> > 19:38:23.385505  Bad eraseblock 334 at 0x0000029c0000
> > 19:38:23.395548  Bad eraseblock 335 at 0x0000029e0000
> > 19:38:23.405501  Bad eraseblock 336 at 0x000002a00000
> > 19:38:23.415551  Bad eraseblock 337 at 0x000002a20000
> > 19:38:23.425937  Bad eraseblock 338 at 0x000002a40000
> > 19:38:23.436028  Bad eraseblock 339 at 0x000002a60000
> > 19:38:23.445959  Bad eraseblock 340 at 0x000002a80000
> > 19:38:23.456008  Bad eraseblock 341 at 0x000002aa0000
> > 19:38:23.466006  Bad eraseblock 342 at 0x000002ac0000
> > 19:38:23.475912  Bad eraseblock 343 at 0x000002ae0000
> > 19:38:23.486064  Bad eraseblock 344 at 0x000002b00000
> > 19:38:23.495925  Bad eraseblock 345 at 0x000002b20000
> > 19:38:24.048053  Bad eraseblock 1022 at 0x000007fc0000
> > 19:38:24.056117  Bad eraseblock 1023 at 0x000007fe0000
> > 19:38:24.067953  Bad block table written to 0x000007fa0000, version 0x01
> > 19:38:24.087637  Bad block table written to 0x000007f80000, version 0x01
> > 
> > 
> > On the next boot all four BBT version in flash are skipped for the same
> > reason
> > as before and the two blocks containing the latest BBT are also detected
> > as
> > bad blocks. The result is no more remaining blocks to write the BBT to.
> > 
> > 
> > 21:22:55.032595  nand: device found, Manufacturer ID: 0x20, Chip ID: 0xa1
> > 21:22:55.033333  nand: ST Micro NAND01GR3B2CZA6
> > 21:22:55.037804  nand: 128 MiB, SLC, erase size: 128 KiB, page size: 2048,
> > OOB
> > size: 64
> > 21:22:55.088475  Bad block table not found for chip 0
> > 21:22:55.093807  Bad block table not found for chip 0
> > 21:22:55.105995  Scanning device for bad blocks
> > 21:22:55.109049  random: fast init done
> > 21:22:55.395488  Bad eraseblock 329 at 0x000002920000
> > 21:22:55.406832  Bad eraseblock 330 at 0x000002940000
> > 21:22:55.416885  Bad eraseblock 331 at 0x000002960000
> > 21:22:55.426736  Bad eraseblock 332 at 0x000002980000
> > 21:22:55.436732  Bad eraseblock 333 at 0x0000029a0000
> > 21:22:55.446864  Bad eraseblock 334 at 0x0000029c0000
> > 21:22:55.456662  Bad eraseblock 335 at 0x0000029e0000
> > 21:22:55.466785  Bad eraseblock 336 at 0x000002a00000
> > 21:22:55.476801  Bad eraseblock 337 at 0x000002a20000
> > 21:22:55.486772  Bad eraseblock 338 at 0x000002a40000
> > 21:22:55.496768  Bad eraseblock 339 at 0x000002a60000
> > 21:22:55.506607  Bad eraseblock 340 at 0x000002a80000
> > 21:22:55.516965  Bad eraseblock 341 at 0x000002aa0000
> > 21:22:55.526621  Bad eraseblock 342 at 0x000002ac0000
> > 21:22:55.536702  Bad eraseblock 343 at 0x000002ae0000
> > 21:22:55.546660  Bad eraseblock 344 at 0x000002b00000
> > 21:22:55.556745  Bad eraseblock 345 at 0x000002b20000
> > 21:22:56.172928  Bad eraseblock 1020 at 0x000007f80000
> > 21:22:56.187043  Bad eraseblock 1021 at 0x000007fa0000
> > 21:22:56.197437  Bad eraseblock 1022 at 0x000007fc0000
> > 21:22:56.212665  Bad eraseblock 1023 at 0x000007fe0000
> > 21:22:56.213356  No space left to write bad block table
> > 21:22:56.215012  nand_bbt: error while writing bad block table -28
> > 21:22:56.239353  mxc_nand: probe of d8000000.nand-controller failed with
> > error
> > -28
> > 
> > I'm not sure of the best way to address this issue. A few ideas came into
> > my
> > mind:
> > 
> > - Shift the offset of the nand_bbt_descr of mxc_nand to make room for the
> > bad
> > block marker. But I'm not sure if this would already conflict with the ECC
> > hardware but the ooblayout functions would suggest that it could work. 
> 
> There are thousands of boards out there that would be broken with such
> change: it's too late to do changes in this driver, unfortunately.
> 
> > Unfortunately I don't have any hardware at hand at the moment to test it.
> > I
> > think the distinction between small and large pagesizes needs to be
> > reflected
> > on the bbt_descr as well.
> > 
> > - Use NAND_BBT_NO_OOB with the mxc_nand driver since there is a comment
> > saying
> > there is an overlap between the generic bbt descriptors and the ECC
> > hardware.
> > I'm not sure what other effects it might have to set NAND_BBT_NO_OOB.
> 
> Same here: that's not an option.
> 
> > - Explicitly check for the bad block marker during a search for the BBT
> > instead of using scan_block_fast
> 
> This look more reasonable. You can create a helper which does the
> scan_block_fast(), then eventually checks the beginning of the OOB
> buffer and tries to match with the ->td and ->md descriptors. This
> should work with all the legacy drivers implementing their own
> descriptors - hopefully.

Thanks for your input. I will take another spin at it.

> 
> Other drivers are impacted as well, so maybe you'll find a board for
> testing (or someone gentle enough that will test it for you).

I hope I'll get my hands at least on one of the imx27 boards.

Thanks,
Stefan

> 
> Thanks,
> Miquèl