Numonyx NOR and chip->mutex bug?

Sat Feb 5 05:29:17 EST 2011

Michael Cashwell <mboards at prograde.net> wrote on 2011/02/04 18:09:01:
>
> On Feb 4, 2011, at 8:05 AM, Joakim Tjernlund wrote:
>
> > Stefan,
> >
> > I think the bug lies in
> > static int inval_cache_and_wait_for_operation(
> >       struct map_info *map, struct flchip *chip,
> >       unsigned long cmd_adr, unsigned long inval_adr, int inval_len,
> >       unsigned int chip_op_time, unsigned int chip_op_time_max)
> > {
> >    struct cfi_private *cfi = map->fldrv_priv;
> >    map_word status, status_OK = CMD(0x80);
> >    int chip_state = chip->state;
> >    unsigned int timeo, sleep_time, reset_timeo;
> >
> >    mutex_unlock(&chip->mutex);
> >    if (inval_len)
> >       INVALIDATE_CACHED_RANGE(map, inval_adr, inval_len);
> >    mutex_lock(&chip->mutex);
> >
> > Here we drop the lock and take it again. Someone may suspend the
> > erase and do a read/write instead and put the chip some other state.
>
> That's where I get confused. It's true that another thread could do that but wouldn't it have to bracket its operation with calls to get_chip() and put_chip()? Those actually do the suspend and resume. I'm certainly seeing erases being suspended to do a buffered write and then being resumed.

buffered write do get_chip that suspends the erase, then a WAIT_TIMEOUT which drops the locks and
gives the suspended erase a chance to wakeup and run but now SR is holding the status
for the write so the suspended erase are testing the wrong status, do you agree?

>
> A properly done suspend/resume should leave the hardware is the same state it was in before hand. So if get_chip/put_chip are not doing that then that's where the problem is.

See my comment above, I don't think inval_cache_and_wait_for_operation is doing the right thing.

>
> Originally I thought a udelay(20) was needed after the resume command was sent. What I've most recently seen is that just doing one map_read() of status (and discarding the result) is enough to avoid my error:
>
>    switch(chip->oldstate) {
>    case FL_ERASING:
>       chip->state = chip->oldstate;
>       /* What if one interleaved chip has finished and the
>          other hasn't? The old code would leave the finished
>          one in READY mode. That's bad, and caused -EROFS
>          errors to be returned from do_erase_oneblock because
>          that's the only bit it checked for at the time.
>          As the state machine appears to explicitly allow
>          sending the 0x70 (Read Status) command to an erasing
>          chip and expecting it to be ignored, that's what we
>          do. */
>       map_write(map, CMD(0xd0), adr);
>       map_write(map, CMD(0x70), adr);
>       (void) map_read(map, adr);   // <<<<----- added
>       chip->oldstate = FL_READY;
>       chip->state = FL_ERASING;
>       break;
>
> I'm still not sure why. My guess is that it's timing (since udelay(20) in the same spot also works for me) but being timing-sensitive is making it hard to pin down. Adding printk()s also often makes it not happen.

I won't worry about this until you test my patch and still have a problem :)
Please test it.

   Jocke