Numonyx NOR and chip->mutex bug?

Sun Feb 6 04:40:48 EST 2011

Michael Cashwell <mboards at prograde.net> wrote on 2011/02/05 20:19:59:
> On Feb 5, 2011, at 5:29 AM, Joakim Tjernlund wrote:
>
> > Michael Cashwell <mboards at prograde.net> wrote on 2011/02/04 18:09:01:
> >>
> >> That's where I get confused. It's true that another thread could do that but wouldn't it have to bracket its operation with calls to get_chip() and put_chip()? Those actually do the suspend and resume. I'm certainly seeing erases being suspended to do a buffered write and then being resumed.
> >
> > buffered write do get_chip that suspends the erase, then a WAIT_TIMEOUT which drops the locks and gives the suspended erase a chance to wakeup and run but now SR is holding the status for the write so the suspended erase are testing the wrong status, do you agree?
>
> It's not that simple. Just waking the suspended erase thread should not cause it to "run" the way you are saying.
>
> That thread *should* wake up in inval_cache_and_wait_for_operation() inside its while (chip->state != chip_state) loop. The purpose of that loop is to see if something else is going on with the chip and if so just go back to sleep. That should prevent what you are saying is happening and if it is
not then we need to figure out why.

dropping the lock MIGTH make another thread waiting on that lock run if the scheduler decides so
or so I think anyway but I am no expert :)

>
> In the case of a buffered write the erase thread should see chip->state as either FL_WRITING_TO_BUFFER (if after the 0xE8) or FL_WRITING (if after the 0xD0). Since those don't match the expected FL_ERASING state the thread should stay in the while loop and go back to sleep on the wait queue. It
should keep doing that until the chip->state has come back to FL_ERASING. That should not happen until the write function is finished and does its put_chip and erase resume.
>
> So what I'm saying is that just dropping the locks and letting other threads run is not, by itself itself, wrong. The code's supposed to do that. It's just that the other threads are supposed check the conditions when they wake up and only proceed if the conditions are correct.

So we need to know if dropping the lock is guarantied not to schedule some other thread

>
> But I do see something suspicious. I think it can be argued that all CFI thread wake-ups need to do the while (chip->state != chip_state) dance and go back to sleep if not.

Makes sense.

>
> That seems to be the case for inval_cache_and_wait_for_operation() but what about the other places where threads can be scheduled out? Won't they just wake up where they are or am I missing some wait queue magic?
>
> IOW: We should look for other places outside of inval_cache_and_wait_for_operation() that call things like msleep, schedule, and cond_resched. (One sneaky one is cfi_udelay as it may call cond_resched.)
>
> If any of those fail to do the chip->state != chip_state sleep loop when they awaken then we have have found our problem.
>

[SNIP]

>
> > I won't worry about this until you test my patch and still have a problem :)
> > Please test it.
>
> I tested it quickly on Friday and it didn't fix my problem, but we may be fighting different problems.

Yes, I am getting the feeling that there are 2 different problems.

>
> I also can't be sure the .c file didn't have other changes. I was really pressed for time that day.
>
> My plan is to retest your patch against a pristine copy of the file and report back. Unfortunately this has to wait until Monday as I don't have access to the hardware over the weekend.
>
> -Mike
>