[bug report] mtd: bad block counter inflated when repeatedly marking the same block

Mon Sep 1 02:26:45 PDT 2025

Hi all,

I’d like to report a mismatch between bad-block statistics and actual
on-flash state when repeatedly calling MEMSETBADBLOCK on the same
eraseblock.

Summary
- Repeatedly marking the same block bad (e.g., 5 times) makes
   /sys/class/mtd/mtdX/bad_blocks increase by 5.
- After reboot,  the statistical value ture to the correct value of 1.
- So the runtime counter (ecc_stats.badblocks) is inflated.

Repro (with nandsim.ko)

```bash
# ID="0xec,0xa1,0x00,0x15" # 128M 128KB 2KB
# modprobe nandsim id_bytes=$ID
# ~/mtd-utils/mtd_markbad /dev/mtd1 10 1 # Repeat 5 times
......
# ~/mtd-utils/mtd_markbad /dev/mtd1 10 1

# -- It can be observed that 5 bad blocks will appear in the statistical information.
# cat /sys/class/mtd/mtd1/bad_blocks
5

# -- In fact, we can only scan 1 bad block.
# ubiformat -v /dev/mtd1  | grep "bad eraseblock"
ubiformat: 1 bad eraseblocks found, numbers: 10
```

Root cause analysis (kernel-side)

```
mtd_block_markbad
   mtd->_block_markbad()
     nand_block_markbad
       ret = nand_block_isbad
       return 0; // ret > 0
   mtd->ecc_stats.badblocks++;  // No bad blocks was marked but was counted.

Relevant code
- drivers/mtd/nand/raw/nand_base.c:nand_block_markbad()
- drivers/mtd/mtdcore.c:mtd_block_markbad()
```

nand_block_markbad() returns 0 both for “newly marked” and “already bad”.
mtdcore cannot tell whether this call actually added a new bad block,
but still increments ecc_stats.badblocks.

Possible fixes (high level)
- Core-side conservative fix (minimal ABI change):
   * In mtd_block_markbad(), probe _block_isbad(master, ofs) before
     calling _block_markbad(), and (if available) probe again after success.
   * Only increment ecc_stats.badblocks if the state transitioned from
     “good” to “bad”.

- Teach *_block_markbad() to return a distinct positive code for
   “already bad” vs “newly marked”, so the core can increment only on
   “newly marked”.

What I want to know is:
- Would the core-side pre/post _block_isbad check be acceptable as a short-term fix?
- Any objections regarding the extra isbad IO in the markbad path?
- Longer-term, is there interest in an explicit API/return-code semantics
   to differentiate “already bad” vs “newly marked”?

I’m very interested in helping resolve this issue and would be grateful
for any guidance or suggestions.

Best regards,
Wang Zhaolong