UBIFS volume corruption (bad node at LEB 0:0)
David Bergeron
mho.linux-mtd at b2n.ca
Mon Jan 19 22:44:48 EST 2009
On 2009-01-19, at 3:56, Artem Bityutskiy wrote:
> Just tried to reproduce this on my x86_64 host without success.
Well, today I had a bit of a breakthrough.
Put the nandsim + rsync loop on the side for now, and let's go back to
a slightly evolved version of my original scenario.
boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/root/
updater.sh
/root/updater.sh:
#!/bin/sh -x
mount -t proc none /proc
ifconfig eth0 ...
mount -o remount,rw /
rsync -aHx --delete systemA /
rsync -aHx --delete systemB /
sync
exec /bin/sh -c "mount -o remount,ro /; reboot -df;"
-EOF-
no more 'sync' flag, but still an orphan LEB 0:0 on first run, every
time, 100% reproducibility. (reboot -d just means don't update /var/
log/wtmp)
So to break it, it's necessary to re-mount rw,SYNC in the nandsim
context, but not in the rootfs update scenario (a stripped down
version of a procedure I've used on jffs2 for two years now)
I've always been puzzled by the lower reproducibility rates in the
nandsim context, because I could replace nandsim with the real flash
and run the same steps (exactly the same hardware, kernel, userland,
rsync, sync data, nand flash, volume layout and what not) The only
difference? one scenario is a live rootfs, and the other is not...
Which finally lead me to the real cause:
It turns out that my test script /root/updater.sh differed between
systemA and systemB, so it was being replaced by rsync *while being
opened & running*.
Indeed, once the script was identical on both rsync targets, UBIFS no
longer broke. (I believe at the time of my first report, my rsync
binary itself was different..)
So here's my sysadmin grade speculation (I'm no filesystem guru):
- A script is executed & open
- It is soon deleted and replaced by a new one but cannot be released
just yet
- There is little time between the file becoming unreferenced and the
filesystem becoming read-only
- [wild speculation] Something goes wrong, the LEB it used had to be
orphaned but it's too late, some bad pointer gives LEB 0 the axe
Now clearly nothing is executed or left open for a long time in the
nandsim + rsync based test, I guess the remount w/sync flag is helping
get the edge case. Try to add the "--delay-updates" option to rsync to
have unlinking rushed at the end. Maybe the slower arm hardware (cm-
x270) I'm using is more prone to this.
I'll try to come up with a sure fire & simple way to replicate this on
nandsim without the rsync hassle.
will keep you posted
Best regards,
-david
More information about the linux-mtd
mailing list