Testing generic empty page bit flips recovery
Franklin S Cooper Jr.
fcooper at ti.com
Wed Dec 30 10:07:44 PST 2015
On 12/30/2015 11:53 AM, Boris Brezillon wrote:
> On Wed, 30 Dec 2015 11:45:38 -0600
> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
>
>>
>> On 12/30/2015 10:59 AM, Boris Brezillon wrote:
>>> On Wed, 30 Dec 2015 10:40:49 -0600
>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
>>>
>>>> On 12/30/2015 10:02 AM, Boris Brezillon wrote:
>>>>> On Wed, 30 Dec 2015 09:33:52 -0600
>>>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
>>>>>
>>>>>> On 12/30/2015 08:40 AM, Boris Brezillon wrote:
>>>>>>> Hi Franklin,
>>>>>>>
>>>>>>> On Wed, 30 Dec 2015 08:10:20 -0600
>>>>>>> "Franklin S Cooper Jr." <fcooper at ti.com> wrote:
>>>>>>>
>>>>>>>> I am trying to follow up on this discussion from this patch
>>>>>>>> set (https://patchwork.ozlabs.org/patch/539059/) which
>>>>>>>> suggested that Michael instead test the generic bitflips
>>>>>>>> recovery that is implemented by Boris "mtd: nand: properly
>>>>>>>> handle bitflips in erased pages" patchset
>>>>>>>> (http://lists.infradead.org/pipermail/linux-mtd/2015-September/061617.html).
>>>>>>>> I would like to test Boris patchset but first I need to
>>>>>>>> recreate the error that his patch is fixing.
>>>>>>>>
>>>>>>>> The error that the patchset is attempting to fix isn't
>>>>>>>> something I have ever encountered before. Currently I am
>>>>>>>> trying to reproduce this issue on a TI K2E evm that uses the
>>>>>>>> davinci nand driver. I flashed the nand's file-system
>>>>>>>> partition with a ubi filesystem and the board is currently
>>>>>>>> set to boot using the file-system on the nand. After about
>>>>>>>> 60 secs I cut the power from the board and boot the board
>>>>>>>> again. What I would expect is that the board will eventually
>>>>>>>> fail to mount the ubi filesystem but currently the board has
>>>>>>>> ran for over 24 hours and powered on and off over 1400 times
>>>>>>>> and its still mounting the file-system perfectly fine.
>>>>>>>>
>>>>>>>> Any suggestions on a test case that I can use to force the
>>>>>>>> empty page bit flips error?
>>>>>>>>
>>>>>>>>
>>>>>>> The davinci driver seems to support raw accesses, so you can try to
>>>>>>> apply this patch [1] against the mtd-utils tree (not sure it still
>>>>>>> applies cleany, but it should work with mtd-utils-1.5.1), and use the
>>>>>>> nandflipbits tool:
>>>>>>>
>>>>>>> # flash_erase /dev/mtdX <offset> 1
>>>>>>> # nandflipbits /dev/mtdX 1@<offset>
>>>>>>> # nanddump -f /tmp/dump -s <offset> -l <page-size> /dev/mtdX
>>>>>>>
>>>>>>> Without the patch, nanddump should complain about uncorrectable errors,
>>>>>>> and if you hexdump /dev/dump you should see the bitflip.
>>>>>>> If nanddump does not complain after applying my patch, then it means it
>>>>>>> fixes the "bitflips in erased pages" bug.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>> [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
>>>>>> Hi Boris,
>>>>>>
>>>>>> Thanks for the quick reply. I built mtd-utils with your
>>>>>> patch and ran the suggested commands on a 4.1 based kernel
>>>>>> without your kernel patchset and I didn't see your expected
>>>>>> output. The 4.1 based kernel hasn't had any changes to
>>>>>> davinci_nand or nand subsystem that would address this
>>>>>> bitflip error.
>>>>>>
>>>>>> I'm currently going to attempt to run the same test on the
>>>>>> latest mainline.
>>>>>>
>>>>>> Here is the output I received when I ran your suggested
>>>>>> commands on the 4.1 based kernel.Any
>>>>>> root at k2e-evm:~# ./flash_erase /dev/mtd4 4096 1
>>>>>> Erasing 128 Kibyte @ 0 -- 100 % complete
>>>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 4096
>>>>>> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 4096 -l 2048
>>>>>> /dev/mtd4
>>>>>> ECC failed: 0
>>>>>> ECC corrected: 0
>>>>>> Number of bad blocks: 0
>>>>>> Number of bbt blocks: 4
>>>>>> Block size 131072, page size 2048, OOB size 64
>>>>>> root at k2e-evm:~# hexdump /tmp/dump
>>>>>> 0000000 fffd ffff ffff ffff ffff ffff ffff ffff
>>>>>> 0000010 ffff ffff ffff ffff ffff ffff ffff ffff
>>>>>> *
>>>>>> 0000800
>>>>>>
>>>>>> Any thoughts on why I'm not seeing the expected error?
>>>>>>
>>>>> Oh, actually this behavior is explained in the commit message:
>>>>>
>>>>> "Currently empty page bit flips are not corrected and report 0 errors."
>>>>>
>>>>> Which explains why you're seeing the bitflip in the dump, but nothing
>>>>> reported by the MTD layer.
>>>>>
>>>>> After applying my patch, the bitflip should simply disappear. You can
>>>>> then try to generate more bitflips than the engine can actually fix
>>>>> (nandflipbits /dev/mtd4 1 at 0:5 at 0:49 at 0:98 at 0:132 at 0) and check that MTD
>>>>> reports an uncorrectable error.
>>>> I verified that I am indeed using ecc4bit mode.
>>>>
>>>> I attempted to run the series of nandflipsbits as you
>>>> suggested but I get "invalid bit description" error from the
>>>> utility. Some reason I can only use the nandflipsbits
>>>> utility for bits 1-7. Anything higher and I get the "Invalid
>>>> bit description" error.
>>> Indeed. I developed that tool a long time ago and didn't remember that
>>> the bit field is encoding the bit offset within a byte. This command
>>> should work.
>>>
>>> nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
>>>
>>>> On the latest master commit I ran nandflipsbits for bits 1-7
>>>> at address 0. However, I still didn't receive any error from
>>>> nanddump although I do see the flip bits from the hexdump
>>>> /tmp/dump output.
>>> How many of them do you see?
>>>
>>>> I then applied your patchset ontop of the latest mainline
>>>> and ran nandflipsbits for bits 1-7 at address 0.
>>>> I get the below output which seems to be correct.
>>>>
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 1 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 2 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 3 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 4 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 5 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 6 at 0
>>>> root at k2e-evm:~# ./nandflipbits /dev/mtd4 7 at 0
>>>> root at k2e-evm:~# ./nanddump -f /tmp/dump -s 0 -l 2048
>>>> /dev/mtd4
>>>>
>>>> ECC failed: 1
>>>> ECC corrected: 18
>>>> Number of bad blocks: 0
>>>> Number of bbt blocks: 4
>>>> Block size 131072, page size 2048, OOB size 64
>>>> Dumping data starting at 0x00000000 and ending at 0x00000800...
>>>> ECC: 4 corrected bitflip(s) at offset 0x00000000
>>>> root at k2e-evm:~# hexdump /tmp/dump
>>>> 0000000 ffff ffff ffff ffff ffff ffff ffff ffff
>>>> *
>>>> 0000800
>>> Hm, that's weird. You should get an ECC failure since the ECC strength
>>> is only 4bits/512byte and you 8 bits have been flipped.
>>>
>>>> One thing that confuses me is if I repeatedly call nanddump
>>>> I continue to get the "ECC: 4 corrected bitflips" message
>>>> and the "ECC corrected" count increases by 4 each time. If
>>>> these bits are being corrected which is apparent from
>>>> looking at the output of nanddump shouldn't sequential calls
>>>> indicate that no bitflips needed to be corrected since it
>>>> was corrected previously?
>>> Nope, they're corrected on the fly and only in RAM, so each time you
>>> read the page, you'll have to fix the bitflips until you erase and
>>> rewrite the faulty block.
>>>
>>>
>> Hi Boris,
>>
>> Here is the entire output that should answer your questions.
>>
>> In the log I am running the following commands:
>> flash_erase /dev/mtd4 0 0
>> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
>> hexdump /tmp/dump
>> ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46:5 at 47
>> ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
>> hexdump /tmp/dump
>>
>> Output on mainline kernel without bitflip correction patches:
>> http://pastebin.com/MgBVxALR
>>
>> Output on mainline kernel with bitflip correction patches:
>> http://pastebin.com/NdKv0NhV
>>
>> Some reason I'm only getting 1 bit being corrected when
>> using the bitflip correction patches. Comparing my logs from
>> before to now the only difference I'm seeing is that ECC
>> failed is increasing but ECC corrected isn't changing.
>>
> That's what I was expecting: your ECC engine is only fixing
> 4bits/512byte, which is why the bitflip in erased page correction fail
> when you have more than 4 bits flipped in a given 512byte block.
>
> Now try to flip only 4 bits instead of 5:
>
> ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46
Here is the output:
root at k2e-evm:~/# ./flash_erase /dev/mtd4 0 1
root at k2e-evm:~/# ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
ECC failed: 5
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000800...
root at k2e-evm:~/# hexdump /tmp/dump
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
0000800
root at k2e-evm:~/# ./nandflipbits /dev/mtd4 1 at 0:5 at 0:7 at 30:3 at 46
root at k2e-evm:~/# ./nanddump -f /tmp/dump -s 0 -l 2048 /dev/mtd4
ECC failed: 5
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 4
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00000800...
ECC: 4 corrected bitflip(s) at offset 0x00000000
root at k2e-evm:~/# hexdump /tmp/dump
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
0000800
Running nanddump again shows that 4 bits were corrected.
So it seems like things are working as expected.
It seems like patches 2-5 from your patchset weren't pulled
in because you and Brian wanted more testing on other
platforms. If your going to submit a rev 4 please feel free
to CC me so I can test the patches out and add a Tested-by.
If not feel free to add a Tested-by for your current rev 3
patchset or if you can bounce those emails my way I can add
it myself. Which ever approach you prefer.
Thank you for your help and let me know if there is any
further test you would like me to run.
>
>
>
More information about the linux-mtd
mailing list