UBIFS question

Thu Mar 17 07:55:44 PDT 2016

Hi Martin,

On Thu, 17 Mar 2016 12:54:43 +0000
Martin Townsend <mtownsend1973 at gmail.com> wrote:

> Hi Ricard, Richard
> 
> On Thu, Mar 17, 2016 at 11:43 AM, Ricard Wanderlof
> <ricard.wanderlof at axis.com> wrote:
> >
> >> > We expect the flash devices to start failing quicker than normally
> >> > expected due to the environment in which they will be operating in, so
> >> > sudden NAND blocks turning bad will eventually happen and what we
> >> > would like to do is try and capture this as soon as possible.
> >> > The boards are not accessible as they will be located in very remote
> >> > locations so detecting these failures before the system locks up would
> >> > be an advantage so we can report home with the information and fail
> >> > over to the other filesystem (providing that hasn't also been
> >> > corrupted).
> >>
> >> Dealing with sudden bad NAND blocks is almost impossible.
> >> Unless you have a copy of each block.
> >> NAND is not expected to gain bad blocks without an indication like
> >> correctable bitflips.
> 
> I'm not interested in dealing with sudden bad NAND blocks, I accept
> this will more than likely happen at some point but what I am
> interested in is early detection.  Once the system has booted most
> files will be cached to memory and the product that the flash devices
> are in is designed to run for many months without being power cycled
> so what I'm looking to do is monitor the health of the flash devices.
> Ideally I would like to know FEC counts but I doubt I will get this
> information :) But checking LEBs, pages etc for bad checksums would be
> great.
> 
> >
> > Yes, although the NAND flash documentation sometimes reads like blocks can
> > suddenly 'go bad' for no special reason, in practice it is due to
> > excessive erase/write cycles, i.e. its a wear problem.
> >
> > However, I don't know, if you are operating the flash in an environment
> > where there is cosmic radiation that can actually damage the chip for
> > instance, then of course any part of the chip could fail randomly with a
> > fairly high probability. But NAND bad block management is not designed to
> > take care of that case, which is why bad block detection is only done
> > during block erasure (i.e. when a block fails to erase).
> >
> I'm not sure how much I can say I'm afraid as I'm under NDA but assume
> that it is going to be operating in an environment where it's
> receiving more cosmic radiation than expected. So I could look at the
> bad block detection code to get some ideas?  I don't necessary want to
> mark blocks as bad I just want to detect them so I have an idea that
> the flash is failing.

I guess you're more worried about bitflips than blocks becoming bad
(which, AFAIK, can only happen when writing or erasing a block, not
when reading it).
If bitflips detection/prevention is what your looking for, I guess
ubihealthd (developed by Richard) could help.

[1]https://lwn.net/Articles/663751/
[2]https://lkml.org/lkml/2015/3/29/31

-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com