[RFC/PATCH 0/5 v2] mtd:ubi: Read disturb and Data retention handling
Richard Weinberger
richard at nod.at
Mon Nov 10 04:35:26 PST 2014
Am 10.11.2014 um 13:07 schrieb Juergen Borleis:
> Hi Richard,
>
> sorry to jump in so lately:
>
> Richard Weinberger wrote:
>>> If we ignore read-disturb and don't' scrubb heavily read blocks we will
>>> have data loss as well. the only difference between the 2 scenarios is
>>> "how long before it happens". Read-disturb wasn't an issue since average
>>> lifespan of a nand device was ~5 years. Read-disturb occurs in a longer
>>> lifespan. that's why it's required now: a need for a "long life nand".
>>
>> Okay, read-disturb will only happen if you read blocks *very* often. Do you
>> have numbers, datasheets, etc...?
>
> I have made some simple test by reading the first 2048 pages of my NAND in an
> endless loop. Only reading, nothing else (made while the bootloader was run,
> nothing else touches the NAND memory).
>
> Below a result of my test with a 512 MiB SLC NAND with 2kiB page size and
> 128kiB block size:
>
> The used NAND controller is able to correct up to 8 flipped bits. After the
> 9th bit is flipped the read returns -74.
>
> This log is a snapshot after the whole area of the first 4 MiB of the NAND
> were read 201688 times.
>
> Page no
> / 1st bitflip after iteration #
> | / 2nd bitflip after iteration #
> | | / 3rd 4th 5th 6th 7th 8th bitflip after iteration #
> | | | / / / / / / error <errcode> @ iteration #
> | | | | | | | | | /
> | | | | | | | | | |
> [...]
> 529: 91760 - - - - - - - err: 0 @ 0
> 530: 67168 - - - - - - - err: 0 @ 0
> 531: 141039 - - - - - - - err: 0 @ 0
> 532: 100288 - - - - - - - err: 0 @ 0
> 533: 133754 - - - - - - - err: 0 @ 0
> 534: 130095 - - - - - - - err: 0 @ 0
> 535: - - - - - - - - err: 0 @ 0
> 536: - - - - - - - - err: 0 @ 0
> 537: - - - - - - - - err: 0 @ 0
> 538: 116134 - - - - - - - err: 0 @ 0
> 539: 198269 - - - - - - - err: 0 @ 0
> 540: 61589 - - - - - - - err: 0 @ 0
> 541: 69437 126618 - - - - - - err: 0 @ 0
> 542: 127839 146936 - - - - - - err: 0 @ 0
> 543: 90092 112675 - - - - - - err: 0 @ 0
> 544: 110714 - - - - - - - err: 0 @ 0
> 545: 102323 179716 - - - - - - err: 0 @ 0
> 546: 63838 107524 - - - - - - err: 0 @ 0
> 547: 140739 - - - - - - - err: 0 @ 0
> 548: 129423 - - - - - - - err: 0 @ 0
> 549: 79855 172562 189242 - - - - - err: 0 @ 0
> 550: 59809 95758 - - - - - - err: 0 @ 0
> 551: 61590 102645 182467 199394 - - - - err: 0 @ 0
> 552: 34892 47024 169765 - - - - - err: 0 @ 0
> 553: 26725 99616 168528 - - - - - err: 0 @ 0
> 554: 23348 117529 160522 194367 - - - - err: 0 @ 0
> 555: 108062 175917 - - - - - - err: 0 @ 0
> 556: 49259 120590 188435 - - - - - err: 0 @ 0
> 557: 54306 96666 120881 - - - - - err: 0 @ 0
> 558: 29085 31802 42191 43422 108748 167569 - - err: 0 @ 0
> 559: 56507 93286 - - - - - - err: 0 @ 0
> 560: 81849 101134 143402 152513 - - - - err: 0 @ 0
> 561: 13890 135991 199507 - - - - - err: 0 @ 0
> 562: 34135 69826 90917 107625 147321 161796 194928 199981 err: 0 @ 0
> 563: 36564 83188 89780 110756 113977 132219 171701 181298 err: -74 @ 196719
> 564: 24710 84965 131464 136672 143401 166123 196109 - err: 0 @ 0
> 565: 63052 190669 200874 - - - - - err: 0 @ 0
> 566: 23602 62334 107324 108235 111701 141831 143176 170709 err: 0 @ 0
> 567: 7827 81759 105200 146536 175196 181900 192630 200021 err: 0 @ 0
> 568: 19248 38095 42491 85788 108021 150404 178145 - err: 0 @ 0
> 569: 77853 93441 116798 149955 175747 - - - err: 0 @ 0
> 570: 23229 34546 60418 84112 169202 191880 198953 - err: 0 @ 0
> 571: 53596 66769 106074 133504 134134 163610 169159 178226 err: -74 @ 180360
> 572: 74009 83572 89710 103833 116947 147067 167137 - err: 0 @ 0
> 573: 23161 43896 89573 95705 102324 102887 115829 122581 err: -74 @ 138582
> [...]
>
> You can see some pages start to suffer from read disturbance after about
> 7,000 reads and fail after 200,000 reads, other pages start at 23,000 reads
> but fails at 120,000 reads. There is no rule when a page starts to suffer
> from read disturbance and how fast. So a simple read counter with a threshhold
> to detect when to recover a page/block seems not helpful to me.
>
> I'm still trying to interpret the test results. At least there are areas in
> the 4 MiB areas which show massive bit flips, while other areas have still no
> flipped bits.
>
> For example the log shown above continues with this pattern:
>
> 574: - - - - - - - - err: 0 @ 0
> 575: - - - - - - - - err: 0 @ 0
> 576: - - - - - - - - err: 0 @ 0
> 577: - - - - - - - - err: 0 @ 0
> 578: - - - - - - - - err: 0 @ 0
> 579: - - - - - - - - err: 0 @ 0
> 580: - - - - - - - - err: 0 @ 0
> 581: - - - - - - - - err: 0 @ 0
> 582: - - - - - - - - err: 0 @ 0
> 583: - - - - - - - - err: 0 @ 0
> 584: - - - - - - - - err: 0 @ 0
> 585: - - - - - - - - err: 0 @ 0
> 586: - - - - - - - - err: 0 @ 0
> 587: - - - - - - - - err: 0 @ 0
> 588: - - - - - - - - err: 0 @ 0
> 589: - - - - - - - - err: 0 @ 0
> 590: - - - - - - - - err: 0 @ 0
> 591: - - - - - - - - err: 0 @ 0
> 592: 194921 - - - - - - - err: 0 @ 0
> 593: - - - - - - - - err: 0 @ 0
> 594: - - - - - - - - err: 0 @ 0
> 595: 99328 186011 - - - - - - err: 0 @ 0
> 596: 178049 188598 - - - - - - err: 0 @ 0
> 597: - - - - - - - - err: 0 @ 0
> 598: 88247 - - - - - - - err: 0 @ 0
> 599: 66701 - - - - - - - err: 0 @ 0
> 600: 68454 - - - - - - - err: 0 @ 0
> 601: 152351 - - - - - - - err: 0 @ 0
> 602: 33574 56123 - - - - - - err: 0 @ 0
> 603: 130160 - - - - - - - err: 0 @ 0
> 604: 87415 - - - - - - - err: 0 @ 0
> 605: 121079 140456 - - - - - - err: 0 @ 0
> 606: 78960 201089 - - - - - - err: 0 @ 0
> 607: 67561 - - - - - - - err: 0 @ 0
> 608: 136825 - - - - - - - err: 0 @ 0
> 609: 46315 - - - - - - - err: 0 @ 0
> 610: 38588 86638 100277 149299 193350 - - - err: 0 @ 0
> 611: 77835 106222 184955 - - - - - err: 0 @ 0
> 612: 82427 196739 - - - - - - err: 0 @ 0
> 613: 45261 69448 - - - - - - err: 0 @ 0
> 614: 49466 177882 - - - - - - err: 0 @ 0
> 615: 68595 130868 - - - - - - err: 0 @ 0
> 616: 40169 134280 151830 - - - - - err: 0 @ 0
> 617: 47167 130047 - - - - - - err: 0 @ 0
> 618: 62839 114948 125289 - - - - - err: 0 @ 0
> 619: 45988 - - - - - - - err: 0 @ 0
> 620: 22611 70944 125715 183733 185630 193842 - - err: 0 @ 0
> 621: 71908 171400 - - - - - - err: 0 @ 0
> 622: 21252 44002 114774 154423 190673 - - - err: 0 @ 0
> 623: 33323 35582 101091 117813 - - - - err: 0 @ 0
> 624: 68726 108034 113045 - - - - - err: 0 @ 0
> 625: 45920 63497 122692 159199 165520 169147 200725 - err: 0 @ 0
> 626: 39039 60375 92903 101632 102331 118883 - - err: 0 @ 0
> 627: 44046 102881 163181 - - - - - err: 0 @ 0
> 628: 53511 89063 158921 194571 - - - - err: 0 @ 0
> 629: 45185 78174 118801 160227 192668 - - - err: 0 @ 0
> 630: 106109 117537 165575 170772 183222 - - - err: 0 @ 0
> 631: 8848 15614 120298 - - - - - err: 0 @ 0
> 632: 58004 - - - - - - - err: 0 @ 0
> 633: 102767 155246 200323 - - - - - err: 0 @ 0
> 634: 44970 45381 78299 103220 108726 174601 - - err: 0 @ 0
> 635: 24964 46413 58086 71776 195353 - - - err: 0 @ 0
> 636: 16024 64719 77322 83557 120118 134934 137786 157911 err: -74 @ 173650
> 637: 54520 76187 89813 97778 125270 150291 178132 185518 err: -74 @ 199306
> 638: - - - - - - - - err: 0 @ 0
> 639: - - - - - - - - err: 0 @ 0
> 640: - - - - - - - - err: 0 @ 0
> 641: - - - - - - - - err: 0 @ 0
> 642: - - - - - - - - err: 0 @ 0
> 643: - - - - - - - - err: 0 @ 0
> 644: - - - - - - - - err: 0 @ 0
> 645: - - - - - - - - err: 0 @ 0
> 646: - - - - - - - - err: 0 @ 0
> 647: - - - - - - - - err: 0 @ 0
> 648: - - - - - - - - err: 0 @ 0
> [...]
>
> More confusing: the same test running on a 256 MiB NAND shows a different
> result with much less failures. After about 200,000 loops *all* pages are
> still okay (or correctable). The max bit flips in one page were four.
>
> [...]
> 546: - - - - - - - - err: 0 @ 0
> 547: - - - - - - - - err: 0 @ 0
> 548: - - - - - - - - err: 0 @ 0
> 549: - - - - - - - - err: 0 @ 0
> 550: - - - - - - - - err: 0 @ 0
> 551: - - - - - - - - err: 0 @ 0
> 552: - - - - - - - - err: 0 @ 0
> 553: - - - - - - - - err: 0 @ 0
> 554: 198362 - - - - - - - err: 0 @ 0
> 555: 138881 - - - - - - - err: 0 @ 0
> 556: - - - - - - - - err: 0 @ 0
> 557: - - - - - - - - err: 0 @ 0
> 558: - - - - - - - - err: 0 @ 0
> 559: 77431 - - - - - - - err: 0 @ 0
> 560: 100023 - - - - - - - err: 0 @ 0
> 561: - - - - - - - - err: 0 @ 0
> 562: 83265 - - - - - - - err: 0 @ 0
> 563: 154552 - - - - - - - err: 0 @ 0
> 564: 154541 - - - - - - - err: 0 @ 0
> 565: - - - - - - - - err: 0 @ 0
> 566: - - - - - - - - err: 0 @ 0
> 567: - - - - - - - - err: 0 @ 0
> 568: 105275 - - - - - - - err: 0 @ 0
> 569: 91386 186096 - - - - - - err: 0 @ 0
> 570: - - - - - - - - err: 0 @ 0
> 571: 43163 - - - - - - - err: 0 @ 0
> 572: 79839 190846 - - - - - - err: 0 @ 0
> 573: 184267 - - - - - - - err: 0 @ 0
> 574: - - - - - - - - err: 0 @ 0
> 575: - - - - - - - - err: 0 @ 0
> 576: - - - - - - - - err: 0 @ 0
> 577: - - - - - - - - err: 0 @ 0
> 578: - - - - - - - - err: 0 @ 0
> 579: - - - - - - - - err: 0 @ 0
> [...]
> 1848: - - - - - - - - err: 0 @ 0
> 1849: 115731 168972 178123 196740 - - - - err: 0 @ 0
> 1850: - - - - - - - - err: 0 @ 0
> [...]
Thanks a lot for this report, your number are a very valuable input.
They prove what Artem and I feared, it is almost impossible to define a sane threshold.
So, having exact read-counters will be almost useless.
All we can do is scrubbing PEBs unconditionally.
Can you share your test program? I'd like to run it also on one of my boards.
Thanks,
//richard
More information about the linux-mtd
mailing list