mtd: nand: raw: Possible bug in nand_onfi_detect()?

Miquel Raynal miquel.raynal at bootlin.com
Thu Mar 7 09:19:31 PST 2024


Hi Alexander,

ada at thorsis.com wrote on Thu, 7 Mar 2024 17:02:16 +0100:

> Hello Miquel,
> 
> thanks for looking into this, see my remarks below.
> 
> Am Wed, Mar 06, 2024 at 04:48:31PM +0100 schrieb Miquel Raynal:
> > Hi Alexander,
> > 
> > ada at thorsis.com wrote on Wed, 6 Mar 2024 15:36:04 +0100:
> >   
> > > Hello everyone,
> > > 
> > > I think I found a bug in nand_onfi_detect() which was introduced with
> > > commit c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page
> > > read to constraint controllers") back in 2020.  
> > 
> > Interesting. I don't think this patch did broke anything, as
> > constrained controllers would just not support the read_data_op() call
> > anyway.
> > 
> > That being said, I don't see why the atmel controller would
> > refuse this operation, as it is supposed to support all
> > operations without limitation. This is one of the three issues
> > you have, that probably needs fixing.  
> 
> I found a flaw in my debug messages hiding the underlying issue for
> this.  I'm afraid this is another bug introduced by you with commit
> 9f820fc0651c ("mtd: rawnand: Check the data only read pattern only
> once").  See this line in rawnand_check_data_only_read_support():
> 
>     if (!nand_read_data_op(chip, NULL, SZ_512, true, true))
> 
> This leads to nand_read_data_op() returning -EINVAL, because it checks
> if its second argument is non-NULL.

Ah, finally. Yes, this makes more sense. I was already notified in
private of something there, I think the contributor (I cannot find the
original mail) told me he would get back on it and did not, but I am
unable to find the thread again in my mailer. Anyhow, this is ringing a
bell, and I am pretty convinced about the bug raised now. Can you
please propose a fix?

You can propose two fixes actually, one for the NULL value and another
one for mtd->writesize being unset at this stage.

IIRC the original reporter told me about bitflips in his parameter page
(which cannot be generated on demand, and this is rather uncommon).

> I guess not only the atmel nand controller is affected here, but _all_
> nand controllers?  The flag can never be set, and so use_datain is
> false here?
> 
> > > Background on how I found this: I'm currently struggling getting raw
> > > nand flash access to fly with an at91 sam9x60 SoC and a S34ML02G1
> > > Spansion SLC raw NAND flash on a custom board.  The setup is
> > > comparable to the sam9x60 curiosity board and can be reproduced with
> > > that one.
> > > 
> > > NAND flash on sam9x60 curiosity board works fine with what is in
> > > mainline Linux kernel.  However after removing the line 'rb-gpios =
> > > <&pioD 5 GPIO_ACTIVE_HIGH>;' from at91-sam9x60_curiosity.dts all data
> > > read from the flash appears to be zeros only.  (I did not add that
> > > line to the dts of my custom board first, this is how I stumbled over
> > > this.)
> > > 
> > > I have no explanation for that behaviour, it should work without R/B#
> > > by reading the status register, maybe we investigate that
> > > in depth later.  
> > 
> > I don't see why at a first look. The default is "no RB" if no property
> > is given in the DT so it should work.  
> 
> Correct, nand_soft_waitrdy() is used in that case.
> 
> > Tracing the wait ready function calls might help.  
> 
> Did that already.  On each call here the status register read contains
> E0h and nand_soft_waitrdy() returns without error, because the
> NAND_STATUS_READY flag is set.  It just looks fine, although it is
> not afterwards.

Strange. Just to be sure, how are you testing? Please make a single
page read (minimal length with mtd_debug or any length with nanddump) to
be sure you're not affected by the continuous reads bugs (also mine).

> > >  However those all zeros data reads happens when
> > > reading the ONFI param page as well es data read from OOB/spare area
> > > later and I bet it's the same with usual data.  
> > 
> > Reading data without observing tWB + tR may lead to this.  
> 
> I already suspected some timing issue.  Deeper investigation will have
> to wait until we soldered some wires to the chip and connect a logic
> analyzer however.  At least that's the plan, but this will have to
> wait some days until after I finished some other tasks.

Sure.

> 
> > > This read error reveals a bug in nand_onfi_detect().  After setting
> > > up some things there's this for loop:
> > > 
> > >     for (i = 0; i < ONFI_PARAM_PAGES; i++) {
> > > 
> > > For i = 0 nand_read_param_page_op() is called and in my case all zeros
> > > are returned and thus the CRC calculated does not match the all zeros
> > > CRC read.  So the usual break on successful reading the first page is
> > > skipped and for reading the second page nand_change_read_column_op()
> > > is called.  I think that one always fails on this line:
> > > 
> > >     if (offset_in_page + len > mtd->writesize + mtd->oobsize) {
> > > 
> > > Those variables contain the following values:
> > > 
> > >     offset_in_page: 256
> > >     len: 256
> > >     mtd->writesize: 0
> > >     mtd->oobsize: 0  
> > 
> > Indeed. We probably need some kind of extra check that does not perform
> > the if clause above if !mtd->writesize.
> >   
> > > The condition is true and nand_change_read_column_op() returns with
> > > -EINVAL, because mtd->writesize and mtd->oobsize are not set yet in
> > > that code path.  Those are probably initialized later, maybe with
> > > parameters read from that ONFI param page?
> > > 
> > > Returning with error from nand_change_read_column_op() leads to
> > > jumping out of nand_onfi_detect() early, and no ONFI param page is
> > > evaluated at all, although the second or third page could be intact.
> > > 
> > > I guess this would also fail with any other reason for not matching
> > > CRCs in the first page, but I have not faulty NAND flash chip to
> > > confirm that.  
> > 
> > Thanks for the whole report, it is interesting and should lead to fixes:
> > - why does the controller refuses the datain op?  
> 
> See above.
> 
> > - why nand_soft_waitrdy is not enough?  
> 
> I don't know.  That's one reason I asked here.
> 
> > - changing the condition in nand_change_read_column_op()
> > 
> > Can you take care of these?  
> 
> The last one probably after in depth reading of the code again, unsure
> for the other two.

First one is "easy" now I guess?

For the middle one we need more investigation of course.

Thanks for the debugging and sorry for the troubles.

Miquèl



More information about the linux-mtd mailing list