Mean Time Between Failure - UBI clarifications

Thu Feb 24 09:04:04 EST 2011

On Thu, 24 Feb 2011, Navaneethan P wrote:

> Hi Linux-mtd users,
>
>
> In our product, we are using 128MB of NAND Flash (Samsung / Micron).
> The whole NAND flash is configured as a single MTD partition. We are
> using UBI over the MTD partition.
>
> With this input, we wanted to calculate the Mean Time between failures
> (MTBF) of our product. In this context,
>
> 1) We wanted to term ?bitflip? as a failure. Is our understanding
> correct or should we only consider a bad block as a failure?

I'd say it's a failure in the sense that the raw data from the flash is 
not what you expect, but UBI handles this transparently so it's not a 
failure from the user's point of view. Furthermore, bitflips are inherent 
to the design of nand flashes, and it does not indicate that there is 
actually anything abnormal about a particular bit.

A bad block is more of a failure in that it can contain bits which are 
unreliable, or stuck at a particular bit level. At least this is the case 
for blocks that have been detected bad at the factory and marked as such, 
but they are not really part of the equation since they should not be used 
anyway.

The ordinary way for a block to 'fail' is when the number of erase/write 
cycles performed on the block causes it to physically wear out. A worn-out 
block has lower data retention (i.e. larger susceptibility to bitflips) 
than other blocks. Usually if an erase or write operation times out (i.e. 
the on-chip erase/write algorithm on the flash times out before the 
operation is completed, and indicates a failure status to the host) the 
block is considered 'bad'. However, note that it is not necessarily an 
either-or situation. The block might not suddenly go dead. Instead, its 
data retention characteristics and erase/write cycle times can get worse 
and worse as the block is erased and rewritten. At some time, the on-chip 
algorithm on the flash signals that erase or write took too long, but the 
characteristics of the block might be far below spec before then.

It's up to you as a user to decide when the block is 'bad' in this case.

> 2) Is there any standard way to findout the number of bitflips from
> the UBI? If no, is it suggested to modify the UBI subsystem of the
> Linux kernel to get the bit flip counter?

mtd supplies statistics counters that might help. For each mtd partition 
there is one counter which is increased every time a read operation 
requires ECC to correct a bit (i.e. a correctable single bit error), and 
one counter for ECC failures (two-bit errors).

I don't know about UBI, someone else probably does.

> 3) Is there any standard software / approach which can be used to find 
> out the reliability / MTBF / MTTF (Mean Time To Failure) of our NAND 
> Flash?

The manufacturers provide some data, however my experience has been that 
it is very difficult to get any form of reliability information.

One way would be to take the spec of the number of erase/write cycles that 
the flash can handle (probably 100 000 for your flash), and calculate how 
much data will be written to the flash over a certain amount of time. When 
you reach 100 000 writes to any given block it can constitute a failure.

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30