Boot failed after patch "mtd: rawnand: Support for sequential cache reads"

Sun May 28 23:10:32 PDT 2023

Hello Miquel.

пт, 26 мая 2023 г. в 21:14, Miquel Raynal <miquel.raynal at bootlin.com>:
> Hi Alexander,
> eagle.alexander923 at gmail.com wrote on Thu, 25 May 2023 10:48:39 +0300:
> > Hello.
> > Kernel boot fails after patch "mtd: rawnand: Support for sequential
> > cache reads" (thanks to git bisect).
> > Please advise what can be done here and where to look for a bug.
> Thanks for the report, and sorry for the trouble. Right now I don't
> know what's wrong with the driver but as a first step, you could just
> try to reset chip->controller->supported_op.cont_read after
> rawnand_check_cont_read_support(). It should just avoid using the
> optimization and solve the boot. That's of course a very early fix, we
> now need to understand further what's going on.

When I comment out the line "rawnand_check_cont_read_support(chip);"
the booting works as expected.

> My first guess would be that the sequential read patterns are not
> supported by the controller or badly implemented by its driver. But
> that is strange given the simplicity of this controller. This
> controller is meant to be versatile, I doubt it does not support these
> operations. Plus, I would expect page accesses to be directly
> implemented by the driver and not be affected by this logic. Could you
> try to trace the actual calls which are made through the mtd layer
> which lead to these errors? Is ->exec_op() involved in the process?
> Where? How?

Yes, Here everything goes as expected, debugging shows that the correct
opcodes are passing, for the NAND_CMD_READCACHESEQ it is 0x31.

> Also, what kernel are you using exactly? I'm surprised there is no
> mtd-related error. If you reboot with an older kernel, you get your
> data, right?

Right. This bug appeared in Linux 6.3. For 6.2 everything worked as expected,
so I used "git bisect" to find the point where the error occurs.

> Otherwise maybe the Micron chip is in fault. Which would mean that
> there are unsupported commands. I believed they were all standard,
> maybe some of them are optional? Could you check in the chip datasheet
> if there is any command used there that is unsupported?

According to the MT29F2G08ABAEAWP datasheet, the chip supports
the READ PAGE CACHE SEQUENTIAL opcode, but with two caveats:
 4. These commands supported only with ECC disabled.
 5. Issuing a READ PAGE CACHE series (31h, 00h-31h, 3Fh) command
  when the array is busy (RDY = 1, ARDY = 0) is supported if the previous
  command was a READ PAGE (00h-30h) or READ PAGE CACHE series
  command; otherwise, it is prohibited.

As far as I understand, the second remark suits us, since we create
the correct sequence.
But the first remark can be a problem in this case.

> > ...
> > omap-gpmc 50000000.gpmc: GPMC revision 6.0
> > ...
> > nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xda
> > nand: Micron MT29F2G08ABAEAWP
> > nand: 256 MiB, SLC, erase size: 128 KiB, page size: 2048, OOB size: 64
> > nand: using OMAP_ECC_BCH8_CODE_HW ECC scheme
> > ...
> > VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
> > devtmpfs: mounted
> > Freeing unused kernel image (initmem) memory: 1024K
> > Run /sbin/init as init process
> > SQUASHFS error: lzo decompression failed, data probably corrupt
> > SQUASHFS error: Failed to read block 0xd291c2: -5
> > SQUASHFS error: lzo decompression failed, data probably corrupt
> > SQUASHFS error: Failed to read block 0xd291c2: -5
> > SQUASHFS error: Unable to read data cache entry [d291c2]
> > SQUASHFS error: Unable to read page, block d291c2, size 14307
> > SQUASHFS error: Unable to read data cache entry [d291c2]
> > SQUASHFS error: Unable to read page, block d291c2, size 14307
> > Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x00000007 CPU: 0 PID: 1 Comm: init Not tainted 6.3.0+ #105
> > Hardware name: Generic AM33XX (Flattened Device Tree)
> >  unwind_backtrace from show_stack+0xb/0xc
> >  show_stack from dump_stack_lvl+0x2b/0x34
> >  dump_stack_lvl from panic+0xbd/0x230
> >  panic from make_task_dead+0x1/0x120
> >  make_task_dead from 0xc102ca80
> > ---[ end Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x00000007 ]---