[PATCH] mtd: cfi: Wait for Block Erase operation to finish

Wed Feb 29 12:22:39 EST 2012

--- On Wed, 29/2/12, Joakim Tjernlund <joakim.tjernlund at transmode.se> wrote:
> If erase is suspended, chip->state is changed which will
> keep the function in
> the for(;;) loop. Once erase has been resumed again
> chip->state will change back.
> It seems to me that chip->state and SR.6 are mutally
> exclusive? I can not see how
> you can get to
>     status = map_read(map, cmd_adr);
>     if (map_word_andequal(map, status,
> status_OK, status_OK))
> if the the erase has been suspended.

OK, I understand. Let me demonstrate.

I Added the following debug to the original linux-3.3-rc5
drivers/mtd/chips/cfi_cmdset_0001.c file (i.e. without my patch):

--------

797a798
> printk("%d: CRDY: about to Erase Suspend adr=0x%08x\n", current->pid, adr);
958a960
> printk("%d: PUTC: enter\n", current->pid);
974a977
> printk("%d: PUTC: about to wake_up (a)\n", current->pid);
988a992
> printk("%d: PUTC: about to wake_up (b)\n", current->pid);
1005a1010
> printk("%d: PUTC: about to Erase Resume adr=0x%08x\n", current->pid, adr);
1025a1031
> printk("%d: PUTC: about to wake_up (c)\n", current->pid);
1216a1223
> printk("%d: WAIT: call-state=%d, cmd_adr=0x%08x\n", current->pid, chip_state, cmd_adr);
1234a1242
> printk("%d: WAIT: about to suspend\n", current->pid);
1235a1244
> printk("%d: WAIT: resumed\n", current->pid);
1241a1251
> printk("%d: WAIT: status=0x%08x\n", current->pid, status.x[0]);
1257a1268
> printk("%d: WAIT: return -ETIME\n", current->pid);
1262a1274
> printk("%d: WAIT: about to sleep: sleep_time=%d\n", current->pid, sleep_time);
1281a1294
> printk("%d: WAIT: return 0\n", current->pid);
1891a1905
> printk("%d: ERAS: adr=0x%08x\n", current->pid, adr);

--------

I rebooted my hx4700, saved the dmesg output, and edited and annotated
what I feel to be the important messages:

--------

[    2.224667] 30: WAIT: status=0x00a000a0
[    2.224693] 30: WAIT: return 0

PID 30 has just failed FL_ERASING of block 0x00440000 for the 3rd time:
erase has failed (SR.5 = 1).

[    2.224719] block erase failed at 0x00440000: status 0xa000a0. Retrying...

Confirmation of erase fail from elsewhere.

[    2.224779] 30: ERAS: adr=0x00440000
[    2.224798] 30: WAIT: call-state=4, cmd_adr=0x00440000
[    2.224814] 30: WAIT: status=0x00000000
[    2.224830] 30: WAIT: about to sleep: sleep_time=512000

PID 30 is now attempting FL_ERASING of block 0x00440000 for the 4th time.

[    2.492129] 76: CRDY: about to Erase Suspend adr=0x0060f748
[    2.492216] 76: PUTC: enter
[    2.492233] 76: PUTC: about to Erase Resume adr=0x0060f748

PID 76 has issued Erase Suspend and Erase Resume (the address is irrelevant).
PID 30 is still sleeping at this point.

[    2.785235] 30: WAIT: status=0x00e800e8
[    2.785250] 30: WAIT: return 0

PID 30 has woken up to find:
bad F-VPP (SR.3 = 1),
erase has failed (SR.5 = 1),
Erase Suspend is in effect (SR.6 = 1).

[    2.785276] physmap-flash: block erase error: (bad VPP)

Confirmation of bad VPP from elsewhere.

[    2.863384] 30: ERAS: adr=0x00440000
[    2.863412] 30: WAIT: call-state=4, cmd_adr=0x00440000
[    2.863431] 30: WAIT: status=0x00c000c0
[    2.863444] 30: WAIT: return 0

PID 30 is now attempting FL_ERASING of block 0x00440000 for the 5th time:
Erase Suspend is in effect (SR.6 = 1),
but no errors, so the erase seemingly succeeded.

[    2.863543] 30: WAIT: call-state=8, cmd_adr=0x00440000
[    2.863562] 30: WAIT: status=0x00c000c0
[    2.863574] 30: WAIT: return 0

PID 30 is now attempting FL_WRITING_TO_BUFFER:
Erase Suspend is in effect (SR.6 = 1),
but no errors, so the FL_WRITING_TO_BUFFER seemingly succeeded.

[    2.863599] 30: WAIT: call-state=7, cmd_adr=0x00440000
[    2.863615] 30: WAIT: status=0x00d000d0
[    2.863628] 30: WAIT: return 0

PID 30 is now attempting FL_WRITING to block 0x00440000:
Erase Suspend is in effect (SR.6 = 1),
program has failed (SR.3 = 1).

[    2.863649] physmap-flash: buffer write error (status 0xd000d0)

Confirmation of write error from elsewhere.

[    2.863703] UBI error: ubi_io_write: error -22 while writing 64 bytes to PEB 262:0, written 0 bytes
[    2.863728] UBI error: erase_worker: failed to erase PEB 262, error -22
[    2.863750] UBI warning: ubi_ro_mode: switch to read-only mode
[    2.863770] UBI error: do_work: work failed with error code -22
[    2.863791] UBI error: ubi_thread: ubi_bgt0d: work failed with error code -22

Bye bye UBI.

--------

First we note that the WAIT (inval_cache_and_wait_for_operation) function
never suspends itself; i.e. it never prints "WAIT: about to suspend".

The message at:
[    2.863431] 30: WAIT: status=0x00c000c0
shows that WAIT found status_OK (SR.7) set and Erase Suspend in effect.

Now things get complicated.

I repeated with exercise with my patch applied. This time the WAIT
function looped forever trying to erase the same block (0x00440000)
because the status never changed from 0x00e800e8 (rather like 2.785235
above, but with my patch now applied the function cannot return).

Disheartened, I thought my patch had only succeeded in livelocking the
function because of a hardware error.

However, remembering our version of the Heisenberg uncertainty principle -
that adding debug sometimes changes the behaviour - I stripped out almost
all of the debug and repeated the exercise. This time block 0x00440000
erased successfully without errors.

Where do we go from here?

The above debug demonstrates that inval_cache_and_wait_for_operation()
can return while Erase Suspend is in effect. My patch prevents that, and
fixes UBI for me.

I am open to the suggestion that my hx4700 exhibits random hardware
failures. This would explain the results. But I have a nagging doubt about
that because my bootloader has never reported erase errors (apart from once
when I broke its erase block size table). As I mentioned previously, the
bootloader never suspends erase operations.

I am open to the suggestion that my patch merely fixes a symptom rather
than an underlying cause.

So perhaps the MTD driver contains a race condition. The stripping out of
debug to fix the erase error tends to support that explanation.

I don't know what else to say.

Paul