ubifs: master area fails to recover when master node 1 is corrupted

Thu Jan 25 18:20:18 PST 2024

在 2024/1/25 19:48, Ryder Wang 写道:
> Hi,
> 
> I just find that master area will always fail to recover while mounting, when master node 1's CRC is corrupted but master node 2 is completely good.  It can be 100% reproduced on Kernel v5.4.233, but it seems a common issue.
> 

According to the debug messages below, the mounting failure occurs as 
follows:
                     LEB 1                       LEB 2
           |mst1 | 0xFF 0xFF ... |      |mst2 | 0xFF 0xFF ... |
offset    0                            0
* mst1 has bad crc.

ubifs_recover_master_node
  get_master_node(UBIFS_MST_LNUM, &mst1)
   ubifs_scan_a_node(buf, lnum, offs=0) // SCANNED_A_CORRUPT_NODE
    ubifs_check_node  // -EUCLEAN, caused by bad crc
   if (offs < c->leb_size) // true
    if (!is_empty(buf, min_t(int, len, sz))) // true
     dbg_rcvry("found corruption at %d:%d")
  get_master_node(UBIFS_MST_LNUM + 1, &buf2, &mst2)
   ubifs_scan_a_node // SCANNED_A_NODE
   *mst = buf // buf = sbuf
   buf2 = sbuf
  if (mst1) // false
  else {
   offs2 = (void *)mst2 - buf2;  // offs2 = 0
   if (offs2 + sz + sz <= c->leb_size) // true, mst2 is the first node 
in LEB 2
     goto out_err
  }

Above process is one situation recovering master nodes after powercut, 
which means that LEB 1 is unmapped and ready to be written the newest 
master node, then powercut happens:
ubifs_write_master
  lnum = UBIFS_MST_LNUM; // LEB 1
  if (offs + UBIFS_MST_NODE_SZ > c->leb_size) // true
   err = ubifs_leb_unmap(c, lnum);
  >> powercut <<
  err = ubifs_write_node_hmac(c->mst_node, lnum)
So master node from LEB 2 can only be recovered in condition that there 
is no room left for new master nodes in LEB 2.
Now, the problem is that we corrupt mst1 to construct this situation, 
UBIFS identifies that the fact is not the expected situation, UBIFS 
refuses to recover master nodes.

> How to reproduce it:
> 1. Corrupt the CRC value of master node 1 (keep master node 2 is good) on ubifs.
> 2. Mount this ubifs.
> 
> Mount at step#2 will always fail. From the log, it looks master recovering fails, but master recovering is expected to be OK in such case.

Master node is not expected to be OK in this situation. These two master 
nodes are not used to recovery in any situations, they are used to find 
a valid version of master node. You can refer to following section in [1]:

"The master node stores the position of all on-flash structures ... The 
first is that there could be a loss of power at the same instant that 
the master node is being written. The second is that there could be 
degradation or corruption of the flash media itself. ... In the second 
case, recovery is not possible because it cannot be determined reliably 
what is a valid master node version."

[1] http://linux-mtd.infradead.org/doc/ubifs_whitepaper.pdf

> 
> Below is the kernel log of this failure:
> 
> ubifs_mount:2253: UBIFS DBG gen (pid 10770): name ubi0:test_volume, flags 0x0
> ubifs_mount:2274: UBIFS DBG gen (pid 10770): opened ubi0_0
> ubifs_read_node:1094: UBIFS DBG io (pid 10770): LEB 0:0, superblock node, length 4096
> UBIFS (ubi0:0): Mounting in unauthenticated mode
> ubifs_read_superblock:765: UBIFS DBG mnt (pid 10770): Auto resizing from 13 LEBs to 100 LEBs
> ubifs_start_scan:131: UBIFS DBG scan (pid 10770): scan LEB 1:0
> ubifs_scan:270: UBIFS DBG scan (pid 10770): look at LEB 1:0 (253952 bytes left)
> ubifs_scan_a_node:77: UBIFS DBG scan (pid 10770): scanning master node at LEB 1:0
> UBIFS error (ubi0:0 pid 10770): ubifs_scan [ubifs]: bad node
> ubifs_recover_master_node:234: UBIFS DBG rcvry (pid 10770): recovery
> ubifs_scan_a_node:77: UBIFS DBG scan (pid 10770): scanning master node at LEB 1:0
> get_master_node:163: UBIFS DBG rcvry (pid 10770): found corruption at 1:0
> ubifs_scan_a_node:77: UBIFS DBG scan (pid 10770): scanning master node at LEB 2:0
> get_master_node:152: UBIFS DBG rcvry (pid 10770): found a master node at 2:0
> UBIFS error (ubi0:0 pid 10770): ubifs_recover_master_node [ubifs]: failed to recover master node
> UBIFS error (ubi0:0 pid 10770): ubifs_recover_master_node [ubifs]: dumping second master node
> UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started, PID 10772
>          magic          0x6101831
>          crc            0x3a5c03b2
>          node_type      7 (master node)
>          group_type     0 (no node group)
>          sqnum          9
>          len            512
>          highest_inum   65
>          commit number  0
>          flags          0x2
>          log_lnum       3
>          root_lnum      12
>          root_offs      0
>          root_len       108
>          gc_lnum        11
>          ihead_lnum     12
>          ihead_offs     4096
>          index_size     112
>          lpt_lnum       7
>          lpt_offs       44
>          nhead_lnum     7
>          nhead_offs     4096
>          ltab_lnum      7
>          ltab_offs      57
>          lsave_lnum     0
>          lsave_offs     0
>          lscan_lnum     10
>          leb_cnt        13
>          empty_lebs     1
>          idx_lebs       1
>          total_free     753664
>          total_dirty    7640
>          total_used     440
>          total_dead     0
>          total_dark     16384
> UBIFS (ubi0:0): background thread "ubifs_bgt0_0" stops
> 
> ______________________________________________________
> Linux MTD discussion mailing list
> http://lists.infradead.org/mailman/listinfo/linux-mtd/
> .
>