[PATCH] ubi: wl: Close down wear-leveling before nand is suspended

Sun Sep 8 20:20:45 PDT 2024

在 2024/9/8 3:28, Mårten Lindahl 写道:
> If a reboot/shutdown signal with double force (-ff) is triggered when
> the erase worker or wear-leveling worker function runs we may end up in
> a race condition since the MTD device gets a reboot notification and
> suspends the nand flash before the erase or wear-leveling is done. This
> will reject all accesses to the flash with -EBUSY.
> 
> Sequence for the erase worker function:
> 
>     systemctl reboot -ff           ubi_thread
> 
>                                  do_work
>   __do_sys_reboot
>     blocking_notifier_call_chain
>       mtd_reboot_notifier
>         nand_shutdown
>           nand_suspend
>                                    __erase_worker
>                                      ubi_sync_erase
>                                        mtd_erase
>                                          nand_erase_nand
> 
>                                            # Blocked by suspended chip
>                                            nand_get_device
>                                              => EBUSY
> 
> Similar sequence for the wear-leveling function:
> 
>     systemctl reboot -ff           ubi_thread
> 
>                                  do_work
>   __do_sys_reboot
>     blocking_notifier_call_chain
>       mtd_reboot_notifier
>         nand_shutdown
>           nand_suspend
>                                    wear_leveling_worker
>                                      ubi_eba_copy_leb
>                                        ubi_io_write
>                                          mtd_write
>                                            nand_write_oob
> 
>                                              # Blocked by suspended chip
>                                              nand_get_device
>                                                => EBUSY
> 
>   systemd-shutdown[1]: Rebooting.
>   ubi0 error: ubi_io_write: error -16 while writing 2048 bytes to PEB
>   CPU: 1 PID: 82 Comm: ubi_bgt0d Kdump: loaded Tainted: G           O
>   (unwind_backtrace) from [<80107b9f>] (show_stack+0xb/0xc)
>   (show_stack) from [<8033641f>] (dump_stack_lvl+0x2b/0x34)
>   (dump_stack_lvl) from [<803b7f3f>] (ubi_io_write+0x3ab/0x4a8)
>   (ubi_io_write) from [<803b817d>] (ubi_io_write_vid_hdr+0x71/0xb4)
>   (ubi_io_write_vid_hdr) from [<803b6971>] (ubi_eba_copy_leb+0x195/0x2f0)
>   (ubi_eba_copy_leb) from [<803b939b>] (wear_leveling_worker+0x2ff/0x738)
>   (wear_leveling_worker) from [<803b86ef>] (do_work+0x5b/0xb0)
>   (do_work) from [<803b9ee1>] (ubi_thread+0xb1/0x11c)
>   (ubi_thread) from [<8012c113>] (kthread+0x11b/0x134)
>   (kthread) from [<80100139>] (ret_from_fork+0x11/0x38)
>   Exception stack(0x80c43fb0 to 0x80c43ff8)
>   ...
>   ubi0 error: ubi_dump_flash: err -16 while reading 2048 bytes from PEB
>   ubi0 error: wear_leveling_worker: error -16 while moving PEB 246 to PEB
>   ubi0 warning: ubi_ro_mode.part.0: switch to read-only mode
>   ...
>   ubi0 error: do_work: work failed with error code -16
>   ubi0 error: ubi_thread: ubi_bgt0d: work failed with error code -16

Yes, I noticed these types of messages too before kernel v5.18. Since 
commit 013e6292aaf5e4b0("mtd: rawnand: Simplify the locking"), the 
behavior of nand_get_device() is changed. A process who is invoking 
nand_get_device() during rebooting won't be stucked, it will get an 
EBUSY error code, that's why we see the above messages from UBI module.
After commit 8cba323437a49a4("mtd: rawnand: protect access to rawnand 
devices while in suspend"), the behavior of nand_get_device() is changed 
back. A process who is invoking nand_get_device() during rebooting will 
be stucked again, so there should be no error messages in UBI layer.
So, is your kernel version lower than v5.18?