pxa3xx_nand issues

Mon Sep 27 08:22:37 EDT 2010

On Mon, Sep 27, 2010 at 7:54 PM, pieterg <pieterg at gmx.com> wrote:
> On Sunday 26 September 2010 16:32:47 Lei Wen wrote:
>> On Thu, Sep 23, 2010 at 11:29 PM, pieterg <pieterg at gmx.com> wrote:
>> > On Thursday 23 September 2010 13:32:26 pieterg wrote:
>> >> On Thursday 23 September 2010 08:05:56 Eric Miao wrote:
>> >> > On Thu, Sep 23, 2010 at 1:12 AM, pieterg <pieterg at gmx.com> wrote:
>> >> > > In my search for the cause of the huge number of single/double bit
>> >> > > errors I'm experiencing on colibri pxa320/310 devices, I've come
>> >> > > across this commit
>> >
>> > http://git.kernel.org/?p=linux/kernel/git/ycmiao/pxa-linux-2.6.git;a=co
>> >mmit;h=7f9938d0fd6c778bd0ce296a3e3b50266de2b892
>> >
>> >> > > According to the commitlog, it attempts to work around an issue
>> >> > > regarding non-page-aligned reads.
>> >> > > The workaround seems to force page-aligned access, by dropping the
>> >> > > offset within the page (column address bytes).
>> >> > > However, in my setup (with a jffs2 filesystem on nand),
>> >> > > non-page-aligned reads never occur, but non-page-aligned writes
>> >> > > occur very frequently. (during the jffs2 gc).
>> >> > > These are also affected by this commit, while the commitlog does
>> >> > > not state whether or not the same issue would occur for the
>> >> > > program command, and in that case, whether or not the same
>> >> > > workaround would apply.
>> >> > >
>> >> > > I've tried to revert the commit, but unfortunately this doesn't
>> >> > > reduce the huge number of single/double bit errors (and jffs2 crc
>> >> > > errors as a result) I'm getting.
>> >> > >
>> >> > > But having these non-aligned writes during GC, would that indicate
>> >> > > a problem with my jffs2 image parameters perhaps?
>> >> > > (though I cannot imagine this could actually cause double bit
>> >> > > errors)
>> >> >
>> >> > It might not be related to the commit above.  The NAND controller
>> >> > will always read the whole page and ignoring the column address,
>> >> > that patch tries to make less confusion. The offset is actually
>> >> > handled completely by software (memorized).
>> >>
>> >> I can see how the read offset works, but I do not quite see how this
>> >> would work for writes (which call the same prepare_read_prog_cmd, and
>> >> have their column address stripped as well).
>> >> Found out that this happens when writing oob data by the way; these
>> >> are writes with offset 2048 within the page. Jffs2 does this when
>> >> writing cleanmarkers.
>> >
>> > Tested this, and found out that this commit is actually quite essential
>> > for writes as well.
>> > Without it, the OOB data doesn't get written.
>> > So we can close this part of the topic, commit 7f9938d0 is perfectly
>> > fine.
>>
>> PXA3xx NAND controller write semantic is to send whole page of data to
>> the NAND flash with
>> the page's address. If you set the ndcr1 not page align value, it is
>> also fine by pxa3xx_nand
>> sending the data to the flash, but nand flash would not accept this
>> kind of behavior as
>> it is defined in its spec.
>>
>> Certainly, if you really need to do this, there is still has ways. :)
>> Send the RANDOM DATA INPUT command (0x80 + 5cycle address + 0x85 +
>> 2cycles column address) + 0x10 would serve this.
>> But seems the pxa310 cannot do such job, which is supported by newer
>> silicon in pxa168 or mmp2.
>>
>> >> I could identify about 10 eraseblocks with pages which produce
>> >> single/double bit errors.
>> >> After I marked them bad (manually), I've seen no more bit errors, and
>> >> the jffs2 rootfs has remained perfectly healthy.
>> >
>> > Turned out to be a short-term solution.
>> > After a while I got more double-bit errors, and ended up bad-marking a
>> > dozen or so other eraseblocks, and it does not seem to stop.
>> >
>> > Strangest thing is that when I write a new jffs2 image with uboot (nand
>> > erase, nand write) or with the kernel (flash_eraseall, nandwrite), it
>> > never contains any biterrors when I mount it.
>> > Only after the filesystem has been mounted, gets modified, and then
>> > after the first reboot, the biterrors are there.
>>
>> You may notice that when a new file system is mounted, the flash is
>> only read by controller.
>> This mean your uboot is all right for writing, and your kernel is also
>> ok for reading.
>> While your driver write function may got broken. Timing? Not so sure...
>>
>> > One other issue which I noticed because besides double bit errors I get
>> > many single bit errors as well; the ERR_SBERR is never cleared.
>> > ERR_DBERR is cleared to ERR_NONE in two locations, but ERR_SBERR is
>> > not. (probably in order to allow pxa3xx_nand_ecc_correct to pick it up)
>> > However, I've seen that the retcode could still be ERR_SBERR in
>> > pxa3xx_nand_waitfunc, causing an erase error to be assumed, as a result
>> > all eraseblocks in the partition ended up being marked bad in a loop,
>> > till there were no more remaining eraseblocks.
>> > I guess ERR_SBERR should probably be ignored in pxa3xx_nand_waitfunc?
>>
>> Em, Although ERR_SBERR indicate this error can be corrected by nand
>> controller, it
>> still make sense to report this to upper level. FS like UBIFS could
>> use this message to
>> do the flash data integrity maintenance. This already be fixed in my
>> patch set which
>> sent a month ago.
>>
>> > That's what I did in the remainder of my tests (after having unmarked
>> > the blocks that were wrongly marked bad) so I think this issue did not
>> > contribute to my biterror problems.
>>
>> Biterr may be caused by timing, bad block... I think you'd better use
>> the mtd test built in linux
>> kernel to make sure timing is all right.
>
> Which mtd test in particular?
> I've run most tests, without any errors:
>
> -oobtest (complains only about read-past-oob-size not returning a proper
> error)
> -pagetest
> -subpagetest
> -readtest
> -speedtest
>
> Not even a single bit error during any of those tests.

That is so weird...
Does your jffs2 image make correct? Page size and block size set right?
You must know that if you write twice on one page, you could also see the
double bit error or single bit error, but it doesn't relate with the
bad block or
timing. And this could explain why you could get all test passed in mtd tests.

Best regards,
Lei