Numonyx NOR and chip->mutex bug?

Sun Feb 6 09:55:05 EST 2011

On Feb 6, 2011, at 4:40 AM, Joakim Tjernlund wrote:

> Michael Cashwell <mboards at prograde.net> wrote on 2011/02/05 20:19:59:
>> 
> 
>> It's not that simple. Just waking the suspended erase thread should not cause it to "run" the way you are saying.
>> 
>> That thread *should* wake up in inval_cache_and_wait_for_operation() inside its while (chip->state != chip_state) loop. The purpose of that loop is to see if something else is going on with the chip and if so just go back to sleep. That should prevent what you are saying is happening and if it is
> not then we need to figure out why.
> 
> dropping the lock MIGTH make another thread waiting on that lock run if the scheduler decides so or so I think anyway but I am no expert :)

Nor am I! :-) And yes, that's completely true. But just running shouldn't do anything bad.

>> So what I'm saying is that just dropping the locks and letting other threads run is not, by itself itself, wrong. The code's supposed to do that. It's just that the other threads are supposed check the conditions when they wake up and only proceed if the conditions are correct.
> 
> So we need to know if dropping the lock is guarantied not to schedule some other thread

I think we can know that at least in SMP systems. If you have multiple cores, just dropping the chip->lock would immediately allow other threads to get past taking that same lock. And even on single-processor systems if the thread in question is over its quantum it can get scheduled out. (There may be special rules for kernel threads.)

>> But I do see something suspicious. I think it can be argued that all CFI thread wake-ups need to do the while (chip->state != chip_state) dance and go back to sleep if not.
> 
> Makes sense.
> 
>> That seems to be the case for inval_cache_and_wait_for_operation() but what about the other places where threads can be scheduled out? Won't they just wake up where they are or am I missing some wait queue magic?
>> 
>> IOW: We should look for other places outside of inval_cache_and_wait_for_operation() that call things like msleep, schedule, and cond_resched. (One sneaky one is cfi_udelay as it may call cond_resched.)
>> 
>> If any of those fail to do the chip->state != chip_state sleep loop when they awaken then we have have found our problem.

And I do see places where at least cond_resched() is being called but are not then followed by the chip->state test. But I need to go through it more rigorously. (cfi_udelay() really worries me. It makes it look like udelay() that doesn't schedule but really calls cond_resched() if the delay is large enough. Risky to hide that, IMO.)

>>> I won't worry about this until you test my patch and still have a problem :)
>>> Please test it.
>> 
>> I tested it quickly on Friday and it didn't fix my problem, but we may be fighting different problems.
> 
> Yes, I am getting the feeling that there are 2 different problems.

Yes. So a proper test of your patch on Monday and more digging.  Fun!

-Mike