Numonyx NOR and chip->mutex bug?
mboards at prograde.net
Sun Feb 6 10:49:53 EST 2011
On Feb 6, 2011, at 4:46 AM, Joakim Tjernlund wrote:
>> It's not that doing that causes incorrect bits in the SR but that it disrupts up the 0xe8/count/data... sequence that must happen atomically. That later causes a command sequence error, which is what the status 0xb0 means.
>> So how is an erase thread waking up and not seeing the chip->state to be something other than the FL_ERASING and going back to sleep?
> The only thing I can think of the the earlier discussion about dropping the lock.
Right. Specifically about why an awakened erase thread that should not proceed (because the chip->state is wrong) is doing so anyway.
That's clearly what's happening in Stefan's trace when thread 465 writes 0xe8 and the next write is 0x70 by thread 209. Such a sequence is absolutely illegal (for the flash) and the latter thread is the problem. If we could get a stack trace for that map_write 0x70 I think we'd find the thread awoke and touched the chip without verifying the state first. The question is why.
One last idea.
The whole for(;;) loop in inval_cache_and_wait_for_operation() looks odd to me. Continuing with your idea of moving the chip->state while loop first, I see other problems. It seems to me that anywhere we drop and retake the chip mutex the very next thing needs to be the state check loop. Any break in holding that mutex means we must go back to the top and check state again.
I don't think the code as written does that. I have a completely reordered version of this function. (It didn't fix my issue but I think mine is something else.) On Monday I'll send that to you so you can consider it.
> Oh, one more thing, possibly one needs to add cpu_relax() or similar to force gcc to reload chip->state in the while loop?
I was also wondering about possible gcc optimization issues. I'm on 4.5.2 and that worked for me with the 2003 flash part. The same binaries fail with the 2008 parts, so, I don't know.
Keep in mind that chip->state is not a hardware access. It's just another struct member. And I think that the rules are that any function call counts as a sequence point and gcc isn't allowed to think it knows what the result is and must reload it.
Lastly, consider the direction of the behavior. If we're in the state-check while loop then we got there because the two things were NOT equal. If an optimization error is causing a stale value to be compared wouldn't the observed behavior be that it remains not equal? (If it's not reloaded then how could it change?)
I'd expect an optimization error like that to get us stuck in the while loop, not exit from it prematurely.
Makes me head hurt!
More information about the linux-mtd