Numonyx NOR and chip->mutex bug?
mboards at prograde.net
Fri Feb 4 12:09:01 EST 2011
On Feb 4, 2011, at 8:05 AM, Joakim Tjernlund wrote:
> I think the bug lies in
> static int inval_cache_and_wait_for_operation(
> struct map_info *map, struct flchip *chip,
> unsigned long cmd_adr, unsigned long inval_adr, int inval_len,
> unsigned int chip_op_time, unsigned int chip_op_time_max)
> struct cfi_private *cfi = map->fldrv_priv;
> map_word status, status_OK = CMD(0x80);
> int chip_state = chip->state;
> unsigned int timeo, sleep_time, reset_timeo;
> if (inval_len)
> INVALIDATE_CACHED_RANGE(map, inval_adr, inval_len);
> Here we drop the lock and take it again. Someone may suspend the
> erase and do a read/write instead and put the chip some other state.
That's where I get confused. It's true that another thread could do that but wouldn't it have to bracket its operation with calls to get_chip() and put_chip()? Those actually do the suspend and resume. I'm certainly seeing erases being suspended to do a buffered write and then being resumed.
A properly done suspend/resume should leave the hardware is the same state it was in before hand. So if get_chip/put_chip are not doing that then that's where the problem is.
Originally I thought a udelay(20) was needed after the resume command was sent. What I've most recently seen is that just doing one map_read() of status (and discarding the result) is enough to avoid my error:
chip->state = chip->oldstate;
/* What if one interleaved chip has finished and the
other hasn't? The old code would leave the finished
one in READY mode. That's bad, and caused -EROFS
errors to be returned from do_erase_oneblock because
that's the only bit it checked for at the time.
As the state machine appears to explicitly allow
sending the 0x70 (Read Status) command to an erasing
chip and expecting it to be ignored, that's what we
map_write(map, CMD(0xd0), adr);
map_write(map, CMD(0x70), adr);
(void) map_read(map, adr); // <<<<----- added
chip->oldstate = FL_READY;
chip->state = FL_ERASING;
I'm still not sure why. My guess is that it's timing (since udelay(20) in the same spot also works for me) but being timing-sensitive is making it hard to pin down. Adding printk()s also often makes it not happen.
More information about the linux-mtd