UBIFS and hardware ECC of all FF pages of MLC NAND

Sun Oct 11 10:38:00 EDT 2009

Hi Artem, 

Some feedback inline.  Thanks.

> The other reason is more subtle, and specific to NAND flashes 
> which have
> ECC calculation algorithm which produces ECC code not 
> equivalent to all
> 0xFF bytes if the NAND page contains only 0xFF bytes. Consider an
> example.
> 
>       * We erase whole flash, so everything is 0xFF'ed now.
>       * We write an UBI/UBIFS image to flash using nandwrite.
>       * Some eraseblocks in the UBIFS image may contain several empty
>         NAND pages at the end, and UBIFS will write to them when it is
>         run.

I think this is dangerous for UBIFS to assume that FF data = FF oob, especially 
as hardware ECCs appearing more and more. It would be nice if there
was a standard that all FF data must generate all FF ECC but this isn't the case
(though it would solve some corruption issues). Perhaps we should leave a runtime
check in (not paranoid check) for the next year or two that checks the oob also 
if the data is all FF just to catch these issues.

>       * When later UBIFS runs, it writes data to these NAND 
> pages, which
>         means that a new ECC code is calculated, and written on top of
>         the existing one (unsuccessfully, of course). This may trigger
>         an error straight away, but usually at this point no error is
>         triggered.

When this happens, you often see an XOR operation taking place on the ECC. For example, if the
ECC for a 512 byte sector all FF data is 
"10 ae d1 f6 12 6c 65 3d 68 86 1a db 4a"
and the new intended ECC for a new sector of non FF data is 
"18 20 f1 91 87 d3 bd 30 a7 4f 3f 23 75"
then I have seen that the resultant ECC (since programming can only change 1's to 0's) is like an AND operation
"10 20 d1 90 02 40 25 30 20 06 1a 03 40"

Now readback validation if it were turned on would catch that the ECC correction could not be 
performed and you could see an error right away in this case. Now an interesting thing
is that I have proven with my 4K page MLC flashes that _other_ blocks can have their ECCs 
corrupted when this collision occurs - though this might be a local hardware issue. That 
took a while to debug in case anyone is having similar problems.

>       * At some point UBIFS is trying to read from these 
> pages, and gets
>         and an ECC error (-EBADMSG = -74).
> 
> In fewer words, ubiformat makes sure that every NAND page is written
> once and only once after the erasure. If you use nandwrite, some pages
> are written twice - once by nandwrite, and once by UBIFS.

This may be all the more reason to leave a runtime check in on the oob being all FF
for a while on all FF data. Good defensive programming to not assume anything about 
what happened earlier with previous flash operations.

> 
> -- 
> Best Regards,
> Artem Bityutskiy (Артём Битюцкий)
> 
> 
> 
Thanks!

Darwin