Fw: corrupt my NAND flash device
Charles Manning
manningc2 at actrix.gen.nz
Mon Apr 28 17:14:42 EDT 2003
I have seen some wierd stuff before... comments further below:
>
> The whole thing just makes me sick. It's ugly putting in such a hack.
> One little voice in my head keeps telling me that there's an error in
> software and I just have to find and fix the bug. Another little voice
> in my head keeps telling me that broken hardware is more common than
> most people want to believe.
Yes, there are/ have been cases where the chips do not latch their commands
correctly. This can be made worse by marginal chip select timing etc.
I was sent some errata sheets by Samsung at some stage, but I did not secure
permission to forward these. In all cases, the identified problems have been
addressed in currently shipping product. To paraphrase the mentioned problems:
* Reading the status too soon after issuing the command: some parts need a
brief wait after latching the command before the busy flag is valid. Without
the wait, the busy state might be misinterpreted. 500ns would be ample.
* Ensuring the correct number of address cycles: I have observed cases where
a chip seems to work when the wrong number of address cycles was issued, but
gave erratic results.
* Issue a reset command before any read/write/erase command. This is a small
overhead and ensures that the command register is always in a consistent
state.
Also check the basics like power and signal integrity. Overshooting/ringing
clocks could very easily be latching spurious data and corrupting the
commands.
>
> I haven't been very aggressive about adding the retry code because right
> now I'm interested in more data points: Am I the only one that sees the
> problem of a flash chip that occasionally drops commands or are others
> seeing this same problem? Is this problem more common but people don't
> see it because the flash filesystems think that a location is bad and
> mark it as unusable?
I'd suggest exploring the above first.
YAFFS is very aggressive about the way it retires data blocks. If any reads
(including verification) have any corruption (even if the ECC fixes them),
then the block is retired. The reason for this strategy is that I have a
theory that blocks get bad with age/use. Rather than encountering
unrecoverable data errors, YAFFS retires blocks on the first sign of a
problem.
Someone has previously suggested that I provide a flag to disable on-NAND
retirement marking during development. I think it is time I added this.
I guess it would be a good thing to do retries in YAFFS to at least get more
information.
-- Charles
More information about the linux-mtd
mailing list