Numonyx NOR and chip->mutex bug?

Wed Feb 9 14:52:23 EST 2011

On Feb 7, 2011, at 2:04 PM, Michael Cashwell wrote:

> To avoid masking the failure I think I need to collect information into a static buffer and printk the buffered data only on the error path. That way, the timing of the code hopefully won't change enough to hide the problem. That's more complicated to do, of course.
> 
> More as I find it.

I added a simple buffer to log the last N map_writes() in cfi_cmdset_0001.c. As expected, I had to be very selective about what and how much to log and the logging code itself could take enough on the hot path to make my failure vanish. But I have been able to distill down the results.

It doesn't look like a threading/locking/mutex problem after all. It looks like a straight timing issue on these chips. Here's my reduced log:

Dumping flash write history:
... ongoing write buffer data ...
...cfi_cmdset_0001.c line 1793 writes 216c  <     last data word
...cfi_cmdset_0001.c line 1808 writes 00d0  <---  write confirm
...cfi_cmdset_0001.c line 1051 writes 0050  +	  presumably the buffer write finished and called put_chip.
...cfi_cmdset_0001.c line 1052 writes 00d0  + <-- put_chip(), oldstate FL_ERASING -> erase resume.
...cfi_cmdset_0001.c line 1053 writes 0070  +     == 20µs delay or discard-read of status register here hides problem ===
...cfi_cmdset_0001.c line 832 writes 00b0   <---| erase suspend.
...cfi_cmdset_0001.c line 839 writes 0070   <---| presumably in get_chip/chip_ready for FL_WRITING at start of
...cfi_cmdset_0001.c line 1743 writes 00e8  <---  buffer write
...cfi_cmdset_0001.c line 1771 writes 01ff  <---  number of words
...cfi_cmdset_0001.c line 1793 writes 0001  <     data word 1
... 510 words of data ...
...cfi_cmdset_0001.c line 1793 writes 5c0c  <     data word 512
...cfi_cmdset_0001.c line 1808 writes 00d0  <---  write confirm
...cfi_cmdset_0001.c line 1829 writes 0050  +     write error path
...cfi_cmdset_0001.c line 1830 writes 0070  +     write error has been sensed
NOR Flash: buffer write error (block locked) status 0xffff (priv = (null), chip = c7cf2ef8)
UBI error: ubi_io_write: error -30 while writing 1960 bytes to PEB 30:57800, written 568 bytes
UBI warning: ubi_eba_write_leb: failed to write data to PEB 30
UBI warning: ubi_ro_mode: switch to read-only mode
UBIFS error (pid 404): ubifs_wbuf_write_nolock: cannot write 1956 bytes to LEB 17:57664, error -30
UBIFS warning (pid 404): ubifs_ro_mode: switched to read-only mode, error -30

So when a buffered write fails, it's when the previous write suspended an erase. But more specifically, it seems that if the suspended erase is resumed and then immediately suspended again things go off the rails. This is between entries for lines 1053 and 832 above.

The chip seems to malfunction if an erase-resume is followed "too soon" by an erase-suspend.

I'd like to do things like sys_gettid() or perhaps get_timeofday() but I find that if I log more than 2 words per map_write() the failure is masked. Still working on that.

The only other thing I've done is ensure that following an erase-suspend for (;;) loop that map_word_bitset(map, status, CMD(0x40)) is true. If not then the erase that was being suspended completed instead (which can happen according to the data sheet). That assertion has not failed in my testing so that's not my problem.

I'm still digging but it's really looking like a straight timing problem with this batch of chips. At what point should I declare defeat and propose my simple fix even though it's just for these chips? The file has several instances of similar "Parts XYZ act funny without this even though it should be unnecessary" workarounds.

-Mike