Numonyx NOR and chip->mutex bug?

Wed Feb 2 11:20:46 EST 2011

On Jan 25, 2011, at 6:09 PM, Joakim Tjernlund wrote:

> On Jan 25, 2011, at 1:14 PM, Michael Cashwell wrote:
> 
>> With this new part I'm seeing MTD errors that I think I've traced to cfi_cmdset_0001.c that I'd like to ask about.
>> 
>> The error manifests when I write hard to a UBIFS file system on this NOR flash. What I see is a "NOR Flash: buffer write error" and then either "(block locked)" or "(Bad VPP)"
> 
> Just check that you didn't get some old samples. We did.

All I know is that the chips are marked with an '08 copyright. (The previous ones are '03.) So we are dealing with old(er) parts in any event. But from what I can tell, this particular part is the current one Numonyx/Micron are selling.

>> The fact that the errors stop if I comment out the chip->mutex calls while waiting [for command completion] suggests to me that there's a reentrancy problem. It doesn't mean the locks are wrong or that doing that is a real fix.
> 
> Oh, I misread earlier. I figured you held the lock for all ops.

In the end I found a failure in the following scenario. A block erase is underway and a request is made to access the chip in order to write data elsewhere. The erase is suspended and the buffered write is performed. When the chip is released after the write operation the code notices the suspended erase and resumes it. But there seems to be a timing issue where the WSM ready bit SR.7 was checked "too soon" following issuing the resume command and it made the code think the erase was complete when it was not.

The normal code paths that *start* erase or program operations have an inherent delay of several µs between writing the command and the first read of the WSM status. This delay is a side effect of a kernel cache invalidate call. But the key issue is that when resuming an erase no such cache invalidation is done as it was already done when the erase originally began.

That difference means there's very little time between the resume command write and the status read. The apparent result is that the WSM is reported "not busy" when in fact the resumption is still being acted upon. The code misinterprets this to mean the resumed erase is complete when it is not and subsequent commands then go fully off the rails as a result.

I cannot find a corresponding timing constraint in the data sheet. By rights, the bus cycle time alone should be enough between the write and read. But in practice, for these parts, it is not. This may be an undocumented erratum for current parts or just an anomaly for this batch. I have no way to tell.

I found the addition of a 20µs delay immediately after the erase resume command avoids the failure. I also tested 10µs and found it to be insufficient. I did not bisect the time further. I have also not explored any similar issue for resumed write operations because it appears that only kernels doing XIP on MTD parts ever do that. I frankly expect the problem would occur then too but I'm not setup to do XIP and don't want to propose changes I cannot test.

I've included the patch that I am using. It also addresses a few other warts and errata I found while debugging this. If these changes are found to have merit after review I'd be happy for them to be included in mainline. Let me know if I can assist in any way.

Stephan, I hope this helps. Since yours is the only report at all similar to mine I'd be very interested in hearing about your progress.

Best regards,
-Mike Cashwell

-------------- next part --------------
A non-text attachment was scrubbed...
Name: linux-2.6.35.7-001-numonyx-errata.patch
Type: application/octet-stream
Size: 2326 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-mtd/attachments/20110202/f9d719db/attachment.obj>