Numonyx NOR and chip->mutex bug?

Sat Feb 5 14:19:59 EST 2011

On Feb 5, 2011, at 5:29 AM, Joakim Tjernlund wrote:

> Michael Cashwell <mboards at prograde.net> wrote on 2011/02/04 18:09:01:
>> 
>> That's where I get confused. It's true that another thread could do that but wouldn't it have to bracket its operation with calls to get_chip() and put_chip()? Those actually do the suspend and resume. I'm certainly seeing erases being suspended to do a buffered write and then being resumed.
> 
> buffered write do get_chip that suspends the erase, then a WAIT_TIMEOUT which drops the locks and gives the suspended erase a chance to wakeup and run but now SR is holding the status for the write so the suspended erase are testing the wrong status, do you agree?

It's not that simple. Just waking the suspended erase thread should not cause it to "run" the way you are saying.

That thread *should* wake up in inval_cache_and_wait_for_operation() inside its while (chip->state != chip_state) loop. The purpose of that loop is to see if something else is going on with the chip and if so just go back to sleep. That should prevent what you are saying is happening and if it is not then we need to figure out why.

In the case of a buffered write the erase thread should see chip->state as either FL_WRITING_TO_BUFFER (if after the 0xE8) or FL_WRITING (if after the 0xD0). Since those don't match the expected FL_ERASING state the thread should stay in the while loop and go back to sleep on the wait queue. It should keep doing that until the chip->state has come back to FL_ERASING. That should not happen until the write function is finished and does its put_chip and erase resume.

So what I'm saying is that just dropping the locks and letting other threads run is not, by itself itself, wrong. The code's supposed to do that. It's just that the other threads are supposed check the conditions when they wake up and only proceed if the conditions are correct.

But I do see something suspicious. I think it can be argued that all CFI thread wake-ups need to do the while (chip->state != chip_state) dance and go back to sleep if not.

That seems to be the case for inval_cache_and_wait_for_operation() but what about the other places where threads can be scheduled out? Won't they just wake up where they are or am I missing some wait queue magic?

IOW: We should look for other places outside of inval_cache_and_wait_for_operation() that call things like msleep, schedule, and cond_resched. (One sneaky one is cfi_udelay as it may call cond_resched.)

If any of those fail to do the chip->state != chip_state sleep loop when they awaken then we have have found our problem.

One last idea is to add more info in your logging code:

ubimkvol /dev/ubi0 -s 6MiB -N test1
[2299][209] do_erase_oneblock start adr=0x00020000 len=0x20000
[2299][209] map_write 0x50 to 0x00020000
[2299][209] map_write 0x20 to 0x00020000
[2299][209] map_write 0xd0 to 0x00020000
[2299][465] map_write 0xb0 to 0x03fe5000
[2307][465] erase suspend 1         adr=0x03fe5000
[2307][465] map_write 0x70 to 0x03fe5000
[2307][465] map_write 0xe8 to 0x03fe5000
[2307][209] map_write 0x70 to 0x00020000
[2307][209] map_write 0x50 to 0x00020000
[2307][209] map_write 0xd0 to 0x00020000
[2307][209] map_write 0x70 to 0x00020000
[2311][209] erase resumed 2b        adr=0x00020000
[2319][209] do_erase_oneblock end   adr=0x00020000 len=0x20000
[2319][465] map_write 0x1ff to 0x03fe5000
[2319][465] map_write 0xc03c0000 to 0x03fe5000
[2319][465] map_write 0xc03c0000 to 0x03fe5002
...
[2327][465] map_write 0xc03c0000 to 0x03fe53fc
[2327][465] map_write 0xc03c0000 to 0x03fe53fe
[2327][465] map_write 0xd0 to 0x03fe5000
[3387][465] map_write 0x50 to 0x03fe5000
[3387][465] map_write 0x70 to 0x03fe5000
[3395][465] 50000000.flash: buffer write error status 0xb0 adr=0x03fe5400 len=0x0 (0x400)

when "erase resumed" happens, it should be inside the put_chip() done when the write command is finished. Something to confirm.

It would be helpful if the log also showed start/end for do_buffer_write() and would print chip->state at various points since that value controls what threads do when they wake up.

> I won't worry about this until you test my patch and still have a problem :)
> Please test it.

I tested it quickly on Friday and it didn't fix my problem, but we may be fighting different problems.

I also can't be sure the .c file didn't have other changes. I was really pressed for time that day.

My plan is to retest your patch against a pristine copy of the file and report back. Unfortunately this has to wait until Monday as I don't have access to the hardware over the weekend.

-Mike