Numonyx NOR and chip->mutex bug?

Joakim Tjernlund joakim.tjernlund at transmode.se
Tue Jan 25 13:56:01 EST 2011


>
> Greetings All.
>
> I've been working on a 2.6.35.7-based ARM kernel for the Gumstix VerdexPro-XL6P hardware. (Our software; their hardware.) Previous boards had a Numonyx 256Mbit NOR flash marked 2003. The latest batch has an F in the part number (indicating a change from 130nm to 65nm) and a 2008 copyright. (The
part is a 256PF30TF).
>
> With this new part I'm seeing MTD errors that I think I've traced to cfi_cmdset_0001.c that I'd like to ask about.
>
> The error manifests when I write hard to a UBIFS file system on this NOR flash. What I see is a "NOR Flash: buffer write error" and then either "(block locked)" or "(Bad VPP)"

I think chip hw error(s). These chips has some strange chip errors so you better check
the errata for your chip. We have seen similar problems with these newer 65Nm Numonyx chips.

>
> Through a lot of tracing (which I can describe if anyone cares) I've determined that the problem is that while waiting for the write-completion command to finish the part mysteriously goes back into array mode. (The variations in the error message above stem from array data being seen as status
bits.)
>
> But how could that happen given the use of the chip->mutex? I think the answer is that the chip->mutex code is broken.
>
> The (non-XIP) wait function inval_cache_and_wait_for_operation() that waits for completion of the buffer program confirm command has several places where it drops and then retakes the chip->mutex. While dropped it does various cache-flush, msleep(), cond_reshed(), and schedule() calls. Exactly
which of these it calls depends on the absolute length of the operation underway.
>
> Interestingly, this new FLASH part has a write buffer of 512 words while the previous part was 32 words. Thus the write times (and time outs) have also increased by a similar x16 factor. I think this is why this has not been seen before.

Should not the write time be about the same? What is the point with a bigger buffer otherwise?

>
> I also think UBI is important in seeing the problem. If I just flash_erase and dd into an MTD partition it works fine. But those programs are single-threaded where improper mutex use won't matter. By contrast, UBI is heavily multi-threaded.
>
> But my confusion is how can dropping that mutex ever be correct? Isn't its purpose to prevent other threads from calling down into the MTD code for the same devices while a lengthy operation is underway? How can it ever be correct to then allow such concurrency in the middle of the operation?
>
> If I comment out all of those unlock/lock calls on chip->mutex in inval_cache_and_wait_for_operation() all the errors vanish. In fact, everything seems to work fine though I'm not doing suspend/resume.
>
> Am I wildly confused in all this? When is dropping the chip->mutex while waiting for lengthy commands needed? Scheduling

When you want to suspend an erase to do a read for example. You don't want be be without
erase suspend, trust me :)

> while holding a spinlock is bad, but we're not dealing with a spinlock but rather a mutex.
>
> Input welcome.

It is unlikely there is a locking problem I think. You only need to lock when testing/changing the
chip->state.




More information about the linux-mtd mailing list