UBIFS does not mount after powerfail

Tue Nov 28 13:00:39 PST 2017

Manfred,

Am Donnerstag, 23. November 2017, 23:03:28 CET schrieb Manfred Spraul:
> Hi Richard,
> 
> I have now three datasets:
> - no xattr, no FASTMAP:
> The log consists of ~189.000 WRITE or ERASE commands.
> -- with chk_fs: 30.000 images tested, all ok.
> 
> -- with chk_fs, when splitting large writes at PAGE_SIZE: 814 images
> tested, all ok.
> 
> --> no issues at all when not using xattr.
> 
> - ecryptfs with ecryptfs_xattr_metadata:
> The log consists of ~188.000 WRITE or ERASE commands.
> 
> -- without chk_fs: 23.000 images tested, 5 not mountable images, all 5
> within garbage_collect_leb():
> 
> If I see it right, the root cause is always a node that crosses a page
> boundary:
> the first half of the node is written, the 2nd half is not written, it
> is still 0xff.
> These nodes cause CRC failures during scanning.
> (perhaps: output of layout_in_empty_space(), writing to a erased LEB
> instead of changing a LEB not properly handled?)
> 
> -- with chk_fs: 795 images tested, 62 not mountable.
> Obviously including the 5 above: chk_fs runs after recovery_completed,
> garbage_collect_leb() is run during recovery.
> 
> -- kill-orphaned-xattr, with chk_fs: 215 images tested, 156 not mountable.
> Note: This is not worse than without the patch. There are long streams
> of images that fail during chk_fs, 200 images is not enough for good
> statistics.
> And: I have not tested the same images as without the patch.
> 
> - ecryptfs with ecryptfs_xattr_metadata and with FASTMAP
> The log consists of ~197.000 WRITE or ERASE commands.
> 
> 21.000 images tested, 178 do not mount. all fail in chk_fs.
> 
> The failure is always something like this:
> > [34802.217857] UBIFS error (ubi0:0 pid 25706): ubifs_read_node: bad
> > node at LEB 243:74672, LEB mapping status 0
> > [34802.218965] Not a node, first 24 bytes:
> > [34802.218969] 00000000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> > ff ff ff ff
> 
> I have not tested with chk_fastmap.
> And: Unlike above, where I tested the last images, I have here tested
> the first 20k images, thus a more or less empty media.
> The lower failure rate could be caused by that.
> 
> Did you have the time to look at the images?
> If you need more images, or if I should test a patch, just ask.

I tied, but TBH I'm completely lost in all the data you throwing on me.

Let's recap, you trigger a corruption that happens only(!) when xattrs are 
used?
How is Fastmap involved in the game? If so, I want to know whether you can 
trigger without Fastmap being enabled.

Which one is the image that failed first with chk_fs enabled?
On a vanilla kernel...

How did you save that image? I'd like to use it in my simulator too
Make sure to not store OOB data.

Thanks,
//richard