UBIFS question

Thu Mar 17 08:39:08 PDT 2016

Hi Boris,

On Thu, Mar 17, 2016 at 2:55 PM, Boris Brezillon
<boris.brezillon at free-electrons.com> wrote:
> Hi Martin,
>
> On Thu, 17 Mar 2016 12:54:43 +0000
> Martin Townsend <mtownsend1973 at gmail.com> wrote:
>
>> Hi Ricard, Richard
>>
>> On Thu, Mar 17, 2016 at 11:43 AM, Ricard Wanderlof
>> <ricard.wanderlof at axis.com> wrote:
>> >
>> >> > We expect the flash devices to start failing quicker than normally
>> >> > expected due to the environment in which they will be operating in, so
>> >> > sudden NAND blocks turning bad will eventually happen and what we
>> >> > would like to do is try and capture this as soon as possible.
>> >> > The boards are not accessible as they will be located in very remote
>> >> > locations so detecting these failures before the system locks up would
>> >> > be an advantage so we can report home with the information and fail
>> >> > over to the other filesystem (providing that hasn't also been
>> >> > corrupted).
>> >>
>> >> Dealing with sudden bad NAND blocks is almost impossible.
>> >> Unless you have a copy of each block.
>> >> NAND is not expected to gain bad blocks without an indication like
>> >> correctable bitflips.
>>
>> I'm not interested in dealing with sudden bad NAND blocks, I accept
>> this will more than likely happen at some point but what I am
>> interested in is early detection.  Once the system has booted most
>> files will be cached to memory and the product that the flash devices
>> are in is designed to run for many months without being power cycled
>> so what I'm looking to do is monitor the health of the flash devices.
>> Ideally I would like to know FEC counts but I doubt I will get this
>> information :) But checking LEBs, pages etc for bad checksums would be
>> great.
>>
>> >
>> > Yes, although the NAND flash documentation sometimes reads like blocks can
>> > suddenly 'go bad' for no special reason, in practice it is due to
>> > excessive erase/write cycles, i.e. its a wear problem.
>> >
>> > However, I don't know, if you are operating the flash in an environment
>> > where there is cosmic radiation that can actually damage the chip for
>> > instance, then of course any part of the chip could fail randomly with a
>> > fairly high probability. But NAND bad block management is not designed to
>> > take care of that case, which is why bad block detection is only done
>> > during block erasure (i.e. when a block fails to erase).
>> >
>> I'm not sure how much I can say I'm afraid as I'm under NDA but assume
>> that it is going to be operating in an environment where it's
>> receiving more cosmic radiation than expected. So I could look at the
>> bad block detection code to get some ideas?  I don't necessary want to
>> mark blocks as bad I just want to detect them so I have an idea that
>> the flash is failing.
>
> I guess you're more worried about bitflips than blocks becoming bad
> (which, AFAIK, can only happen when writing or erasing a block, not
> when reading it).
> If bitflips detection/prevention is what your looking for, I guess
> ubihealthd (developed by Richard) could help.
>
> [1]https://lwn.net/Articles/663751/
> [2]https://lkml.org/lkml/2015/3/29/31
>
>

Looks very promising, thank you for the links.  Bitflip detection is
definitely something I am looking for.  If I could get some metrics on
bitflips detected even better :) I will take a closer look.

Many Thanks,
Martin.

> --
> Boris Brezillon, Free Electrons
> Embedded Linux and Kernel engineering
> http://free-electrons.com