Testing generic empty page bit flips recovery

Boris Brezillon boris.brezillon at free-electrons.com
Wed Dec 30 09:53:48 PST 2015


On Wed, 30 Dec 2015 11:45:38 -0600
"Franklin S Cooper Jr." <fcooper at ti.com> wrote:

> 
> 
> On 12/30/2015 10:59 AM, Boris Brezillon wrote:
> > On Wed, 30 Dec 2015 10:40:49 -0600
> > "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >
> >>
> >> On 12/30/2015 10:02 AM, Boris Brezillon wrote:
> >>> On Wed, 30 Dec 2015 09:33:52 -0600
> >>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >>>
> >>>> On 12/30/2015 08:40 AM, Boris Brezillon wrote:
> >>>>> Hi Franklin,
> >>>>>
> >>>>> On Wed, 30 Dec 2015 08:10:20 -0600
> >>>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >>>>>
> >>>>>> I am trying to follow up on this discussion from this patch
> >>>>>> set (https://patchwork.ozlabs.org/patch/539059/) which
> >>>>>> suggested that Michael instead test the generic bitflips
> >>>>>> recovery that is implemented by Boris "mtd: nand: properly
> >>>>>> handle bitflips in erased pages" patchset
> >>>>>> (http://lists.infradead.org/pipermail/linux-mtd/2015-September/061617.html).
> >>>>>> I would like to test Boris patchset but first I need to
> >>>>>> recreate the error that his patch is fixing.
> >>>>>>
> >>>>>> The error that the patchset is attempting to fix isn't
> >>>>>> something I have ever encountered before. Currently I am
> >>>>>> trying to reproduce this issue on a TI K2E evm that uses the
> >>>>>> davinci nand driver. I flashed the nand's file-system
> >>>>>> partition with a ubi filesystem and the board is currently
> >>>>>> set to boot using the file-system on the nand. After about
> >>>>>> 60 secs I cut the power from the board and boot the board
> >>>>>> again. What I would expect is that the board will eventually
> >>>>>> fail to mount the ubi filesystem but currently the board has
> >>>>>> ran for over 24 hours and powered on and off over 1400 times
> >>>>>> and its still mounting the file-system perfectly fine.
> >>>>>>
> >>>>>> Any suggestions on a test case that I can use to force the
> >>>>>> empty page bit flips error?
> >>>>>>
> >>>>>>
> >>>>> The davinci driver seems to support raw accesses, so you can try to
> >>>>> apply this patch [1] against the mtd-utils tree (not sure it still
> >>>>> applies cleany, but it should work with mtd-utils-1.5.1), and use the
> >>>>> nandflipbits tool:
> >>>>>
> >>>>> # flash_erase /dev/mtdX <offset> 1
> >>>>> # nandflipbits /dev/mtdX 1@<offset>
> >>>>> # nanddump -f /tmp/dump -s <offset> -l <page-size> /dev/mtdX
> >>>>>
> >>>>> Without the patch, nanddump should complain about uncorrectable errors,
> >>>>> and if you hexdump /dev/dump you should see the bitflip.
> >>>>> If nanddump does not complain after applying my patch, then it means it
> >>>>> fixes the "bitflips in erased pages" bug.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Boris
> >>>>>
> >>>>> [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
> >>>> Hi Boris,
> >>>>
> >>>> Thanks for the quick reply. I built mtd-utils with your
> >>>> patch and ran the suggested commands on a 4.1 based kernel
> >>>> without your kernel patchset and I didn't see your expected
> >>>> output. The 4.1 based kernel hasn't had any changes to
> >>>> davinci_nand or nand subsystem that would address this
> >>>> bitflip error.
> >>>>
> >>>> I'm currently going to attempt to run the same test on the
> >>>> latest mainline.
> >>>>
> >>>> Here is the output I received when I ran your suggested
> >>>> commands on the 4.1 based kernel.Any
> >>>> root at k2e-evm:~# ./flash_erase /dev/mtd4 4096 1
> >>>> Erasing 128 Kibyte @ 0 -- 100 % complete
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 4096
> >>>> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 4096 -l 2048
> >>>> /dev/mtd4
> >>>> ECC failed: 0
> >>>> ECC corrected: 0
> >>>> Number of bad blocks: 0
> >>>> Number of bbt blocks: 4
> >>>> Block size 131072, page size 2048, OOB size 64
> >>>> root at k2e-evm:~# hexdump /tmp/dump
> >>>> 0000000 fffd ffff ffff ffff ffff ffff ffff ffff
> >>>> 0000010 ffff ffff ffff ffff ffff ffff ffff ffff
> >>>> *
> >>>> 0000800
> >>>>
> >>>> Any thoughts on why I'm not seeing the expected error?
> >>>>
> >>> Oh, actually this behavior is explained in the commit message:
> >>>
> >>> "Currently empty page bit flips are not corrected and report 0 errors."
> >>>
> >>> Which explains why you're seeing the bitflip in the dump, but nothing
> >>> reported by the MTD layer.
> >>>
> >>> After applying my patch, the bitflip should simply disappear. You can
> >>> then try to generate more bitflips than the engine can actually fix
> >>> (nandflipbits /dev/mtd4 1 at 0:5 at 0:49 at 0:98 at 0:132 at 0) and check that MTD
> >>> reports an uncorrectable error.
> >> I verified that I am indeed using ecc4bit mode.
> >>
> >> I attempted to run the series of nandflipsbits as you
> >> suggested but I get "invalid bit description" error from the
> >> utility. Some reason I  can only use the nandflipsbits
> >> utility for bits 1-7. Anything higher and I get the "Invalid
> >> bit description" error.
> > Indeed. I developed that tool a long time ago and didn't remember that
> > the bit field is encoding the bit offset within a byte. This command
> > should work.
> >
> > nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
> >
> >> On the latest master commit I ran nandflipsbits for bits 1-7
> >> at address 0. However, I still didn't receive any error from
> >> nanddump although I do see the flip bits from the hexdump
> >> /tmp/dump output.
> > How many of them do you see?
> >
> >> I then applied your patchset ontop of the latest mainline
> >> and ran nandflipsbits for bits 1-7 at address 0.
> >> I get the below output which seems to be correct.
> >>
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 2 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 3 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 4 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 5 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 6 at 0
> >> root at k2e-evm:~# ./nandflipbits /dev/mtd4 7 at 0
> >> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 0 -l 2048
> >> /dev/mtd4                                                                                                                                                           
> >>
> >> ECC failed: 1
> >> ECC corrected: 18
> >> Number of bad blocks: 0
> >> Number of bbt blocks: 4
> >> Block size 131072, page size 2048, OOB size 64
> >> Dumping data starting at 0x00000000 and ending at 0x00000800...
> >> ECC: 4 corrected bitflip(s) at offset 0x00000000
> >> root at k2e-evm:~# hexdump /tmp/dump
> >> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff
> >> *
> >> 0000800
> > Hm, that's weird. You should get an ECC failure since the ECC strength
> > is only 4bits/512byte and you 8 bits have been flipped.
> >
> >> One thing that confuses me is if I repeatedly call nanddump
> >> I continue to get the "ECC: 4 corrected bitflips" message
> >> and the "ECC corrected" count increases by 4 each time. If
> >> these bits are being corrected which is apparent from
> >> looking at the output of nanddump shouldn't sequential calls
> >> indicate that no bitflips needed to be corrected since it
> >> was corrected previously?
> > Nope, they're corrected on the fly and only in RAM, so each time you
> > read the page, you'll have to fix the bitflips until you erase and
> > rewrite the faulty block.
> >
> >
> 
> Hi Boris,
> 
> Here is the entire output that should answer your questions.
> 
> In the log I am running the following commands:
> flash_erase /dev/mtd4 0 0
> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> hexdump /tmp/dump
> ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> hexdump /tmp/dump
> 
> Output on mainline kernel without bitflip correction patches:
> http://pastebin.com/MgBVxALR
> 
> Output on mainline kernel with bitflip correction patches:
> http://pastebin.com/NdKv0NhV
> 
> Some reason I'm only getting 1 bit being corrected when
> using the bitflip correction patches. Comparing my logs from
> before to now the only difference I'm seeing is that ECC
> failed is increasing but ECC corrected isn't changing.
> 

That's what I was expecting: your ECC engine is only fixing
4bits/512byte, which is why the bitflip in erased page correction fail
when you have more than 4 bits flipped in a given 512byte block.

Now try to flip only 4 bits instead of 5:

./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46



-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com



More information about the linux-mtd mailing list