Mean Time Between Failure - UBI clarifications
Ricard Wanderlof
ricard.wanderlof at axis.com
Thu Feb 24 09:04:04 EST 2011
On Thu, 24 Feb 2011, Navaneethan P wrote:
> Hi Linux-mtd users,
>
>
> In our product, we are using 128MB of NAND Flash (Samsung / Micron).
> The whole NAND flash is configured as a single MTD partition. We are
> using UBI over the MTD partition.
>
> With this input, we wanted to calculate the Mean Time between failures
> (MTBF) of our product. In this context,
>
> 1) We wanted to term ?bitflip? as a failure. Is our understanding
> correct or should we only consider a bad block as a failure?
I'd say it's a failure in the sense that the raw data from the flash is
not what you expect, but UBI handles this transparently so it's not a
failure from the user's point of view. Furthermore, bitflips are inherent
to the design of nand flashes, and it does not indicate that there is
actually anything abnormal about a particular bit.
A bad block is more of a failure in that it can contain bits which are
unreliable, or stuck at a particular bit level. At least this is the case
for blocks that have been detected bad at the factory and marked as such,
but they are not really part of the equation since they should not be used
anyway.
The ordinary way for a block to 'fail' is when the number of erase/write
cycles performed on the block causes it to physically wear out. A worn-out
block has lower data retention (i.e. larger susceptibility to bitflips)
than other blocks. Usually if an erase or write operation times out (i.e.
the on-chip erase/write algorithm on the flash times out before the
operation is completed, and indicates a failure status to the host) the
block is considered 'bad'. However, note that it is not necessarily an
either-or situation. The block might not suddenly go dead. Instead, its
data retention characteristics and erase/write cycle times can get worse
and worse as the block is erased and rewritten. At some time, the on-chip
algorithm on the flash signals that erase or write took too long, but the
characteristics of the block might be far below spec before then.
It's up to you as a user to decide when the block is 'bad' in this case.
> 2) Is there any standard way to findout the number of bitflips from
> the UBI? If no, is it suggested to modify the UBI subsystem of the
> Linux kernel to get the bit flip counter?
mtd supplies statistics counters that might help. For each mtd partition
there is one counter which is increased every time a read operation
requires ECC to correct a bit (i.e. a correctable single bit error), and
one counter for ECC failures (two-bit errors).
I don't know about UBI, someone else probably does.
> 3) Is there any standard software / approach which can be used to find
> out the reliability / MTBF / MTTF (Mean Time To Failure) of our NAND
> Flash?
The manufacturers provide some data, however my experience has been that
it is very difficult to get any form of reliability information.
One way would be to take the spec of the number of erase/write cycles that
the flash can handle (probably 100 000 for your flash), and calculate how
much data will be written to the flash over a certain amount of time. When
you reach 100 000 writes to any given block it can constitute a failure.
/Ricard
--
Ricard Wolf Wanderlöf ricardw(at)axis.com
Axis Communications AB, Lund, Sweden www.axis.com
Phone +46 46 272 2016 Fax +46 46 13 61 30
More information about the linux-mtd
mailing list