Make NAND_BBT_NO_OOB_BBM configurable or let the gpmi driver decide?

Tue Mar 15 01:34:37 PDT 2022

Hi Lothar,

LW at KARO-electronics.de wrote on Tue, 15 Mar 2022 08:06:02 +0100:

> Miquel Raynal <miquel.raynal at bootlin.com> wrote:
> 
> > Hi Daniel,
> > 
> > Sorry for the delay.
> > 
> > dg at emlix.com wrote on Thu, 24 Feb 2022 19:17:43 +0100:
> >   
> > > Hi Miquel,
> > > 
> > > Am 24.02.22 um 17:03 schrieb Miquel Raynal:    
> > > > dg at emlix.com wrote on Thu, 24 Feb 2022 16:55:27 +0100:      
> > > >> Am 24.02.22 um 16:29 schrieb Miquel Raynal:      
> > > >>> dg at emlix.com wrote on Wed, 23 Feb 2022 11:59:02 +0100:        
> > > >>>> Am 22.02.22 um 23:02 schrieb Han Xu:>>> Could you please
> > > >>>> describe more details about what kind of error, how to        
> > > >>>>> reproduce it and on which kernel version?          
> > > >>>>
> > > >>>> You need a flash that has one bad block where programming the
> > > >>>> BBM sets NAND_STATUS_FAIL in its status register. The latest
> > > >>>> kernels should still have problems when this happens in a
> > > >>>> UBI.        
> > > >>>
> > > >>> I believe we should try to tackle "why" this happens more than
> > > >>> try to workaround its consequences. Can you give more details
> > > >>> about why we get this status?        
> > > >>
> > > >> Uhm, the block is bad, broken. It shows the same behavior even
> > > >> after power cycling. The other blocks are ok. I don't think it
> > > >> is our fault that it died so early.      
> > > > 
> > > > But why after a power cycle are we trying to write the BBM?      
> > > 
> > > I did not want to imply that Linux tries to write the block after
> > > every power cycle. UBI notices that the block is broken once and
> > > manages to mark it as bad in the BBT, so after power cycle it will
> > > not try to write to that block again. What I wanted to say is that
> > > manual testing of the block after power cycling shows that the
> > > block remains unusable.
> > > 
> > > The problem is that UBI switches to read-only mode after it marked
> > > the block as bad in the BBT because the redundant BBM in the OOB of
> > > the block could not be written.    
> > 
> > I think I understand better your situation now.
> > 
> > So here is our problem : why can't we write the OOB? If there is a
> > good reason this cannot happen, then we can provide the
> > NAND_BBT_NO_OOB_BBM flag. Otherwise we should find the root cause.
> >   
> > > And we don't want to get into a situation
> > > where we have to reboot the system, especially if it is because of
> > > something we don't need.
> > > 
> > > We could change nand_block_markbad_lowlevel to return success as
> > > long as updating the BBT succeeds, if you think that this is the
> > > correct approach.    
> > 
> > That is not a correct approach if we did not asked to bypass writing
> > BBMs explicitly.
> >  
> The BBM in the OOB area is a "Factory Bad Block Marker" where the
> manufacturer marks initially bad blocks. There is no guarantee that the
> BBM can be written on a block that turned bad lateron.

Writing a BBM means programming one byte to 0. We don't care about the
other bytes in the entire page, really, so we don't really care if
other bits flip during this operation. Worst case scenario: none of
the bits in the BBM are programmed (quite unlikely given the fact that
it's probably the "data" which triggered the errors in the first
place, even less likely knowing that only the first page of the block
will receive the marker while it's maybe not this page which shown
errors in the first place).

Anyway, let's assume the bad block marker cannot be programmed. Why
would the raw PROGRAM PAGE operation fail? There is no read back
happening automatically. We need to understand why the NAND op failed
in the first place, I don't think it is related to the page being bad,
more to the specific 1-byte write that the driver tries to do. I
believe this issue is gpmi-specific.

> If a block turned BAD during use it is completely useless to try writing
> anything to it.

Not necessarily, in particular if UBI decided to turn it bad. It does
not mean the block has wear out completely, it just means that the
block is about to wear out.

> Depending on the nature of the NAND error that turned
> the block bad, trying to write that block may also affect random other
> blocks.

I don't think this can happen on SLC. And on MLC I believe it is
'correctly' handled thanks to the known pairing scheme.

Thanks,
Miquèl