Numonyx NOR and chip->mutex bug?
Joakim Tjernlund
joakim.tjernlund at transmode.se
Sat Feb 5 05:29:17 EST 2011
Michael Cashwell <mboards at prograde.net> wrote on 2011/02/04 18:09:01:
>
> On Feb 4, 2011, at 8:05 AM, Joakim Tjernlund wrote:
>
> > Stefan,
> >
> > I think the bug lies in
> > static int inval_cache_and_wait_for_operation(
> > struct map_info *map, struct flchip *chip,
> > unsigned long cmd_adr, unsigned long inval_adr, int inval_len,
> > unsigned int chip_op_time, unsigned int chip_op_time_max)
> > {
> > struct cfi_private *cfi = map->fldrv_priv;
> > map_word status, status_OK = CMD(0x80);
> > int chip_state = chip->state;
> > unsigned int timeo, sleep_time, reset_timeo;
> >
> > mutex_unlock(&chip->mutex);
> > if (inval_len)
> > INVALIDATE_CACHED_RANGE(map, inval_adr, inval_len);
> > mutex_lock(&chip->mutex);
> >
> > Here we drop the lock and take it again. Someone may suspend the
> > erase and do a read/write instead and put the chip some other state.
>
> That's where I get confused. It's true that another thread could do that but wouldn't it have to bracket its operation with calls to get_chip() and put_chip()? Those actually do the suspend and resume. I'm certainly seeing erases being suspended to do a buffered write and then being resumed.
buffered write do get_chip that suspends the erase, then a WAIT_TIMEOUT which drops the locks and
gives the suspended erase a chance to wakeup and run but now SR is holding the status
for the write so the suspended erase are testing the wrong status, do you agree?
>
> A properly done suspend/resume should leave the hardware is the same state it was in before hand. So if get_chip/put_chip are not doing that then that's where the problem is.
See my comment above, I don't think inval_cache_and_wait_for_operation is doing the right thing.
>
> Originally I thought a udelay(20) was needed after the resume command was sent. What I've most recently seen is that just doing one map_read() of status (and discarding the result) is enough to avoid my error:
>
> switch(chip->oldstate) {
> case FL_ERASING:
> chip->state = chip->oldstate;
> /* What if one interleaved chip has finished and the
> other hasn't? The old code would leave the finished
> one in READY mode. That's bad, and caused -EROFS
> errors to be returned from do_erase_oneblock because
> that's the only bit it checked for at the time.
> As the state machine appears to explicitly allow
> sending the 0x70 (Read Status) command to an erasing
> chip and expecting it to be ignored, that's what we
> do. */
> map_write(map, CMD(0xd0), adr);
> map_write(map, CMD(0x70), adr);
> (void) map_read(map, adr); // <<<<----- added
> chip->oldstate = FL_READY;
> chip->state = FL_ERASING;
> break;
>
> I'm still not sure why. My guess is that it's timing (since udelay(20) in the same spot also works for me) but being timing-sensitive is making it hard to pin down. Adding printk()s also often makes it not happen.
I won't worry about this until you test my patch and still have a problem :)
Please test it.
Jocke
More information about the linux-mtd
mailing list