NAND Bad Block Marking Policy

Fri Feb 19 01:55:30 PST 2016

On Thu, 18 Feb 2016, Guilherme de Oliveira Costa wrote:

> Indeed, the U-Boot solution is derived from the  MTD implementation.
> 
> While I understand that factory bad blocks are marked in such a way, isn't
> it possible that a random page in a block goes bad, and we should mark its
> eraseblock as bad too? Or is such a thing impossible?

If we decide that a page within a block is 'bad' we should mark the whole 
block as being bad.

The question is however, what does 'bad' mean? From a factory point of 
view, bad blocks are blocks which don't meet the specs of the component. 
This can be for instance that some physical defect on the chip renders a 
block useless, or that the data retention or other parameters don't meet 
the specs.

Since a 'bad' block could contain any type of defect, it is unwise to use 
it even if it appears there's only a single cell or page that doesn't seem 
to be working properly. That's why 'bad' blocks are on the block 
granularity. In normal use, the only practical way to get a 'bad' block is 
by excessive writing and erasure. In most cases, you'd never have to mark 
a block as bad. So having 'bad blocks' with block granularity does not 
incur much of a memory capacity penalty, and it's part of the specs of the 
nand flash chips that a certain percentage of blocks may be bad or go bad 
during the lifetime of the chip.

In Linux, the only time we mark a block as bad is when JFFS2 notes that it 
fails to erase properly. (Correct me if I'm wrong on this one, that's the 
way it used to be, did a quick grep for markbad in jffs2 and ubifs and 
that's all I came up with). So from Linux point of view, 'bad' means the 
whole block is bad (because it can't be successfully erased, and erasure 
always takes place on a block level).

> This question is motivated by the fact that our NANDs are not indicating any
> Bad Blocks, which we find very weird, since every piece of literature we've
> come across says that there WILL be bad block in a NAND. We are worried
> that our bad block checking is somehow broken, and we keep overwriting
> this information.

Given a batch of nand flashes, it is not unlikely that there will be a 
certain percentage of devices without bad blocks.

In my (limited) experience, there seem to be two types of factory-marked 
bad blocks. Some blocks are simply marked as bad, and they can be erased 
just like any other blocks, and often used with no immediate problems. 
Most likely these blocks don't meet some manufacturer specification during 
testing at the factory, like data retention time. In a lab environment, I 
have used such bad blocks with no problems, but I would never considering 
letting a product with such resuscitated bad blocks reach an end user.

Then there are bad blocks which contain all zeros, and which cannot be 
erased. I would think these are somehow marked at an earlier stage in 
manufacture, when it has been deemed that there are physical problems in 
the actual chip, so the block in question is disconnected in order to 
avoid problems.

(I haven't verified these two bad block types with any manufacturer, just 
empirical studies of random individual SLC flashes in the range 1-2 Gb.)

/Ricard
-- 
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30