Testing generic empty page bit flips recovery

Boris Brezillon boris.brezillon at free-electrons.com
Wed Dec 30 11:43:54 PST 2015


On Wed, 30 Dec 2015 12:07:44 -0600
"Franklin S Cooper Jr." <fcooper at ti.com> wrote:

> 
> 
> On 12/30/2015 11:53 AM, Boris Brezillon wrote:
> > On Wed, 30 Dec 2015 11:45:38 -0600
> > "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >
> >>
> >> On 12/30/2015 10:59 AM, Boris Brezillon wrote:
> >>> On Wed, 30 Dec 2015 10:40:49 -0600
> >>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >>>
> >>>> On 12/30/2015 10:02 AM, Boris Brezillon wrote:
> >>>>> On Wed, 30 Dec 2015 09:33:52 -0600
> >>>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >>>>>
> >>>>>> On 12/30/2015 08:40 AM, Boris Brezillon wrote:
> >>>>>>> Hi Franklin,
> >>>>>>>
> >>>>>>> On Wed, 30 Dec 2015 08:10:20 -0600
> >>>>>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
> >>>>>>>
> >>>>>>>> I am trying to follow up on this discussion from this patch
> >>>>>>>> set (https://patchwork.ozlabs.org/patch/539059/) which
> >>>>>>>> suggested that Michael instead test the generic bitflips
> >>>>>>>> recovery that is implemented by Boris "mtd: nand: properly
> >>>>>>>> handle bitflips in erased pages" patchset
> >>>>>>>> (http://lists.infradead.org/pipermail/linux-mtd/2015-September/061617.html).
> >>>>>>>> I would like to test Boris patchset but first I need to
> >>>>>>>> recreate the error that his patch is fixing.
> >>>>>>>>
> >>>>>>>> The error that the patchset is attempting to fix isn't
> >>>>>>>> something I have ever encountered before. Currently I am
> >>>>>>>> trying to reproduce this issue on a TI K2E evm that uses the
> >>>>>>>> davinci nand driver. I flashed the nand's file-system
> >>>>>>>> partition with a ubi filesystem and the board is currently
> >>>>>>>> set to boot using the file-system on the nand. After about
> >>>>>>>> 60 secs I cut the power from the board and boot the board
> >>>>>>>> again. What I would expect is that the board will eventually
> >>>>>>>> fail to mount the ubi filesystem but currently the board has
> >>>>>>>> ran for over 24 hours and powered on and off over 1400 times
> >>>>>>>> and its still mounting the file-system perfectly fine.
> >>>>>>>>
> >>>>>>>> Any suggestions on a test case that I can use to force the
> >>>>>>>> empty page bit flips error?
> >>>>>>>>
> >>>>>>>>
> >>>>>>> The davinci driver seems to support raw accesses, so you can try to
> >>>>>>> apply this patch [1] against the mtd-utils tree (not sure it still
> >>>>>>> applies cleany, but it should work with mtd-utils-1.5.1), and use the
> >>>>>>> nandflipbits tool:
> >>>>>>>
> >>>>>>> # flash_erase /dev/mtdX <offset> 1
> >>>>>>> # nandflipbits /dev/mtdX 1@<offset>
> >>>>>>> # nanddump -f /tmp/dump -s <offset> -l <page-size> /dev/mtdX
> >>>>>>>
> >>>>>>> Without the patch, nanddump should complain about uncorrectable errors,
> >>>>>>> and if you hexdump /dev/dump you should see the bitflip.
> >>>>>>> If nanddump does not complain after applying my patch, then it means it
> >>>>>>> fixes the "bitflips in erased pages" bug.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>>
> >>>>>>> Boris
> >>>>>>>
> >>>>>>> [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
> >>>>>> Hi Boris,
> >>>>>>
> >>>>>> Thanks for the quick reply. I built mtd-utils with your
> >>>>>> patch and ran the suggested commands on a 4.1 based kernel
> >>>>>> without your kernel patchset and I didn't see your expected
> >>>>>> output. The 4.1 based kernel hasn't had any changes to
> >>>>>> davinci_nand or nand subsystem that would address this
> >>>>>> bitflip error.
> >>>>>>
> >>>>>> I'm currently going to attempt to run the same test on the
> >>>>>> latest mainline.
> >>>>>>
> >>>>>> Here is the output I received when I ran your suggested
> >>>>>> commands on the 4.1 based kernel.Any
> >>>>>> root at k2e-evm:~# ./flash_erase /dev/mtd4 4096 1
> >>>>>> Erasing 128 Kibyte @ 0 -- 100 % complete
> >>>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 4096
> >>>>>> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 4096 -l 2048
> >>>>>> /dev/mtd4
> >>>>>> ECC failed: 0
> >>>>>> ECC corrected: 0
> >>>>>> Number of bad blocks: 0
> >>>>>> Number of bbt blocks: 4
> >>>>>> Block size 131072, page size 2048, OOB size 64
> >>>>>> root at k2e-evm:~# hexdump /tmp/dump
> >>>>>> 0000000 fffd ffff ffff ffff ffff ffff ffff ffff
> >>>>>> 0000010 ffff ffff ffff ffff ffff ffff ffff ffff
> >>>>>> *
> >>>>>> 0000800
> >>>>>>
> >>>>>> Any thoughts on why I'm not seeing the expected error?
> >>>>>>
> >>>>> Oh, actually this behavior is explained in the commit message:
> >>>>>
> >>>>> "Currently empty page bit flips are not corrected and report 0 errors."
> >>>>>
> >>>>> Which explains why you're seeing the bitflip in the dump, but nothing
> >>>>> reported by the MTD layer.
> >>>>>
> >>>>> After applying my patch, the bitflip should simply disappear. You can
> >>>>> then try to generate more bitflips than the engine can actually fix
> >>>>> (nandflipbits /dev/mtd4 1 at 0:5 at 0:49 at 0:98 at 0:132 at 0) and check that MTD
> >>>>> reports an uncorrectable error.
> >>>> I verified that I am indeed using ecc4bit mode.
> >>>>
> >>>> I attempted to run the series of nandflipsbits as you
> >>>> suggested but I get "invalid bit description" error from the
> >>>> utility. Some reason I  can only use the nandflipsbits
> >>>> utility for bits 1-7. Anything higher and I get the "Invalid
> >>>> bit description" error.
> >>> Indeed. I developed that tool a long time ago and didn't remember that
> >>> the bit field is encoding the bit offset within a byte. This command
> >>> should work.
> >>>
> >>> nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
> >>>
> >>>> On the latest master commit I ran nandflipsbits for bits 1-7
> >>>> at address 0. However, I still didn't receive any error from
> >>>> nanddump although I do see the flip bits from the hexdump
> >>>> /tmp/dump output.
> >>> How many of them do you see?
> >>>
> >>>> I then applied your patchset ontop of the latest mainline
> >>>> and ran nandflipsbits for bits 1-7 at address 0.
> >>>> I get the below output which seems to be correct.
> >>>>
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 2 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 3 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 4 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 5 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 6 at 0
> >>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 7 at 0
> >>>> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 0 -l 2048
> >>>> /dev/mtd4                                                                                                                                                           
> >>>>
> >>>> ECC failed: 1
> >>>> ECC corrected: 18
> >>>> Number of bad blocks: 0
> >>>> Number of bbt blocks: 4
> >>>> Block size 131072, page size 2048, OOB size 64
> >>>> Dumping data starting at 0x00000000 and ending at 0x00000800...
> >>>> ECC: 4 corrected bitflip(s) at offset 0x00000000
> >>>> root at k2e-evm:~# hexdump /tmp/dump
> >>>> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff
> >>>> *
> >>>> 0000800
> >>> Hm, that's weird. You should get an ECC failure since the ECC strength
> >>> is only 4bits/512byte and you 8 bits have been flipped.
> >>>
> >>>> One thing that confuses me is if I repeatedly call nanddump
> >>>> I continue to get the "ECC: 4 corrected bitflips" message
> >>>> and the "ECC corrected" count increases by 4 each time. If
> >>>> these bits are being corrected which is apparent from
> >>>> looking at the output of nanddump shouldn't sequential calls
> >>>> indicate that no bitflips needed to be corrected since it
> >>>> was corrected previously?
> >>> Nope, they're corrected on the fly and only in RAM, so each time you
> >>> read the page, you'll have to fix the bitflips until you erase and
> >>> rewrite the faulty block.
> >>>
> >>>
> >> Hi Boris,
> >>
> >> Here is the entire output that should answer your questions.
> >>
> >> In the log I am running the following commands:
> >> flash_erase /dev/mtd4 0 0
> >> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> >> hexdump /tmp/dump
> >> ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
> >> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> >> hexdump /tmp/dump
> >>
> >> Output on mainline kernel without bitflip correction patches:
> >> http://pastebin.com/MgBVxALR
> >>
> >> Output on mainline kernel with bitflip correction patches:
> >> http://pastebin.com/NdKv0NhV
> >>
> >> Some reason I'm only getting 1 bit being corrected when
> >> using the bitflip correction patches. Comparing my logs from
> >> before to now the only difference I'm seeing is that ECC
> >> failed is increasing but ECC corrected isn't changing.
> >>
> > That's what I was expecting: your ECC engine is only fixing
> > 4bits/512byte, which is why the bitflip in erased page correction fail
> > when you have more than 4 bits flipped in a given 512byte block.
> >
> > Now try to flip only 4 bits instead of 5:
> >
> > ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46
> 
> Here is the output:
> root at k2e-evm:~/# ./flash_erase /dev/mtd4 0 1
> root at k2e-evm:~/# ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> ECC failed: 5
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000800...
> root at k2e-evm:~/# hexdump /tmp/dump                           
> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 0000800
> root at k2e-evm:~/# ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46
> root at k2e-evm:~/# ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
> ECC failed: 5
> ECC corrected: 0
> Number of bad blocks: 0
> Number of bbt blocks: 4
> Block size 131072, page size 2048, OOB size 64
> Dumping data starting at 0x00000000 and ending at 0x00000800...
> ECC: 4 corrected bitflip(s) at offset 0x00000000
> root at k2e-evm:~/# hexdump /tmp/dump                           
> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff
> *
> 0000800
> 
> Running nanddump again shows that 4 bits were corrected.
> So it seems like things are working as expected.
> 
> It seems like patches 2-5 from your patchset weren't pulled
> in because you and Brian wanted more testing on other
> platforms. If your going to submit a rev 4 please feel free
> to CC me so I can test the patches out and add a Tested-by.

Just sent a v4. Feel free to test it and add your
Tested-by/Acked-by/Reviewed-by.


-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com



More information about the linux-mtd mailing list