UBIFS volume corruption (bad node at LEB 0:0)

Mon Jan 19 22:44:48 EST 2009

On 2009-01-19, at 3:56, Artem Bityutskiy wrote:
> Just tried to reproduce this on my x86_64 host without success.

Well, today I had a bit of a breakthrough.

Put the nandsim + rsync loop on the side for now, and let's go back to  
a slightly evolved version of my original scenario.

boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/root/ 
updater.sh

/root/updater.sh:
#!/bin/sh -x

mount -t proc none /proc
ifconfig eth0 ...
mount -o remount,rw /
rsync -aHx --delete systemA /
rsync -aHx --delete systemB /
sync
exec /bin/sh -c "mount -o remount,ro /; reboot -df;"
-EOF-

no more 'sync' flag, but still an orphan LEB 0:0 on first run, every  
time, 100% reproducibility. (reboot -d just means don't update /var/ 
log/wtmp)

So to break it, it's necessary to re-mount rw,SYNC in the nandsim  
context, but not in the rootfs update scenario (a stripped down  
version of a procedure I've used on jffs2 for two years now)
I've always been puzzled by the lower reproducibility rates in the  
nandsim context, because I could replace nandsim with the real flash  
and run the same steps (exactly the same hardware, kernel, userland,  
rsync, sync data, nand flash, volume layout and what not) The only  
difference? one scenario is a live rootfs, and the other is not...

Which finally lead me to the real cause:
It turns out that my test script /root/updater.sh differed between  
systemA and systemB, so it was being replaced by rsync *while being  
opened & running*.

Indeed, once the script was identical on both rsync targets, UBIFS no  
longer broke. (I believe at the time of my first report, my rsync  
binary itself was different..)

So here's my sysadmin grade speculation (I'm no filesystem guru):
- A script is executed & open
- It is soon deleted and replaced by a new one but cannot be released  
just yet
- There is little time between the file becoming unreferenced and the  
filesystem becoming read-only
- [wild speculation] Something goes wrong, the LEB it used had to be  
orphaned but it's too late, some bad pointer gives LEB 0 the axe

Now clearly nothing is executed or left open for a long time in the  
nandsim + rsync based test, I guess the remount w/sync flag is helping  
get the edge case. Try to add the "--delay-updates" option to rsync to  
have unlinking rushed at the end. Maybe the slower arm hardware (cm- 
x270) I'm using is more prone to this.

I'll try to come up with a sure fire & simple way to replicate this on  
nandsim without the rsync hassle.

will keep you posted

Best regards,
-david