Boot failed after patch "mtd: rawnand: Support for sequential cache reads"

Mon May 29 01:12:40 PDT 2023

Hi Alexander,

eagle.alexander923 at gmail.com wrote on Mon, 29 May 2023 09:10:32 +0300:

> Hello Miquel.
> 
> пт, 26 мая 2023 г. в 21:14, Miquel Raynal <miquel.raynal at bootlin.com>:
> > Hi Alexander,
> > eagle.alexander923 at gmail.com wrote on Thu, 25 May 2023 10:48:39 +0300:  
> > > Hello.
> > > Kernel boot fails after patch "mtd: rawnand: Support for sequential
> > > cache reads" (thanks to git bisect).
> > > Please advise what can be done here and where to look for a bug.  
> > Thanks for the report, and sorry for the trouble. Right now I don't
> > know what's wrong with the driver but as a first step, you could just
> > try to reset chip->controller->supported_op.cont_read after
> > rawnand_check_cont_read_support(). It should just avoid using the
> > optimization and solve the boot. That's of course a very early fix, we
> > now need to understand further what's going on.  
> 
> When I comment out the line "rawnand_check_cont_read_support(chip);"
> the booting works as expected.
> 
> > My first guess would be that the sequential read patterns are not
> > supported by the controller or badly implemented by its driver. But
> > that is strange given the simplicity of this controller. This
> > controller is meant to be versatile, I doubt it does not support these
> > operations. Plus, I would expect page accesses to be directly
> > implemented by the driver and not be affected by this logic. Could you
> > try to trace the actual calls which are made through the mtd layer
> > which lead to these errors? Is ->exec_op() involved in the process?
> > Where? How?  
> 
> Yes, Here everything goes as expected, debugging shows that the correct
> opcodes are passing, for the NAND_CMD_READCACHESEQ it is 0x31.
> 
> > Also, what kernel are you using exactly? I'm surprised there is no
> > mtd-related error. If you reboot with an older kernel, you get your
> > data, right?  
> 
> Right. This bug appeared in Linux 6.3. For 6.2 everything worked as expected,
> so I used "git bisect" to find the point where the error occurs.
> 
> > Otherwise maybe the Micron chip is in fault. Which would mean that
> > there are unsupported commands. I believed they were all standard,
> > maybe some of them are optional? Could you check in the chip datasheet
> > if there is any command used there that is unsupported?  
> 
> According to the MT29F2G08ABAEAWP datasheet, the chip supports
> the READ PAGE CACHE SEQUENTIAL opcode, but with two caveats:
>  4. These commands supported only with ECC disabled.
>  5. Issuing a READ PAGE CACHE series (31h, 00h-31h, 3Fh) command
>   when the array is busy (RDY = 1, ARDY = 0) is supported if the previous
>   command was a READ PAGE (00h-30h) or READ PAGE CACHE series
>   command; otherwise, it is prohibited.
> 
> As far as I understand, the second remark suits us, since we create
> the correct sequence.

Exactly, we do:

	READ0 (0), READSTART (30),
	READCACHESEQ (31), data,
	READCACHESEQ (31), data,
	...
	READCACHEEND (3f), data.

which is what the datasheet tells us I believe.

> But the first remark can be a problem in this case.

I was not aware of this limitation, it's only written in the summary,
not in the details about the commands, nice finding. We need to prevent
on-die ECC users from enabling this feature.

But given the below trace, you're not using the on-die ECC engine,
right? It looks like you're using the controller's ELM engine to
perform ECC correction, so I don't see why this specific limitation
would hit us. Can you confirm the ECC engine of the chip is disabled?

> > > ...
> > > omap-gpmc 50000000.gpmc: GPMC revision 6.0
> > > ...
> > > nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xda
> > > nand: Micron MT29F2G08ABAEAWP
> > > nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> > > nand: using OMAP_ECC_BCH8_CODE_HW ECC scheme
> > > ...
> > > VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
> > > devtmpfs: mounted
> > > Freeing unused kernel image (initmem) memory: 1024K
> > > Run /sbin/init as init process
> > > SQUASHFS error: lzo decompression failed, data probably corrupt
> > > SQUASHFS error: Failed to read block 0xd291c2: -5
> > > SQUASHFS error: lzo decompression failed, data probably corrupt
> > > SQUASHFS error: Failed to read block 0xd291c2: -5
> > > SQUASHFS error: Unable to read data cache entry [d291c2]
> > > SQUASHFS error: Unable to read page, block d291c2, size 14307
> > > SQUASHFS error: Unable to read data cache entry [d291c2]
> > > SQUASHFS error: Unable to read page, block d291c2, size 14307
> > > Kernel panic - not syncing: Attempted to kill init!
> > > exitcode=0x00000007 CPU: 0 PID: 1 Comm: init Not tainted 6.3.0+ #105
> > > Hardware name: Generic AM33XX (Flattened Device Tree)
> > >  unwind_backtrace from show_stack+0xb/0xc
> > >  show_stack from dump_stack_lvl+0x2b/0x34
> > >  dump_stack_lvl from panic+0xbd/0x230
> > >  panic from make_task_dead+0x1/0x120
> > >  make_task_dead from 0xc102ca80
> > > ---[ end Kernel panic - not syncing: Attempted to kill init!
> > > exitcode=0x00000007 ]---  

Thanks,
Miquèl