Environment changes lead to weird boot behaviour

Thu Feb 14 13:29:15 EST 2013

Hi,

I try to investigate a situation where barebox (v2013.02.0 + board patches)
fails to boot the linux kernel on my karo-tx53 based board. The problem may
well be introduced by myself, but after a few days of investigation is still
fail to grasp the problem's root.

Depending on whether files are present in the boot environment the kernel may
start in some cases, in some it won't.

The file contents seems not to be relevant, since I've managed to get a
broken boot situation, by simply adding a ash script doing one 'echo blah'.

In all cases barebox shuts down in orderly fashion, and jumps to the kernel
image. The kernel in question is a zImage (3.4) + Initramfs + concatenated
devicetree. Also another zImage + concatenated devicetree is affected.


Background: I am implementing a 'foolproof' field update scheme. The
control flow looks like:

(Good Case)  boot0 -(A)-> bootA/bootB -(B)-> kernel
(Bad Case 1) boot0 -(A)-> bootA/bootB -(C)-> rescue-kernel
(Bad Case 2) boot0 -(D)-> rescue-kernel

boot0   .. 1st stage barebox in 256k NAND partition
bootA/B .. 2nd stage barebox in 256k NAND partition
kernel  .. production kernel + ubiroot in NAND
rescue-kernel  .. selfcontained rescue kernel + initramfs in NAND
bootenv   .. stores just state variables. (256k NAND partition)
scriptenv .. stores just scripts and static config (bundled with 2ndstage)


(A) boot0 checks one of 2 partitions with 2nd stage barebox in a uimage,
    and boots the newer one.
(B) 2nd stage bb starts production system
(C) 2nd stage bb starts rescue kernel bc button/bootenv says so.
(D) 1st stage bb starts rescue system bc no 2nd stage is valid

I want to be able to exchange 2nd stage without hassle. To do this,
I've introduced a split of the bootenvironment: boot scripts stay with
the barebox image, non-volatile data is saved in a barebox environment.

The following patch accomplishes this:

diff --git a/common/startup.c b/common/startup.c
index 14409a2..59e76ac 100644
--- a/common/startup.c
+++ b/common/startup.c
@@ -108,15 +108,17 @@ void start_barebox (void)
        debug("initcalls done\n");
 
 #ifdef CONFIG_ENV_HANDLING
-       if (envfs_load(default_environment_path, "/env", 0)) {
+       envfs_load("/dev/defaultenv", "/env", 0);
 #ifdef CONFIG_DEFAULT_ENVIRONMENT
+       mkdir("/var", 0);
+       if (envfs_load(default_environment_path, "/var", 0)) {
                printf("no valid environment found on %s. "
                        "Using default environment\n",
                        default_environment_path);
-               envfs_load("/dev/defaultenv", "/env", 0);
-#endif
+               envfs_save("/dev/env0", "/var");
        }
 #endif
+#endif
 #ifdef CONFIG_COMMAND_SUPPORT
        printf("running /env/bin/init...\n");
 

Everything looks peachy, until I add a file in the boot environment
using the bareboxenv tool. Say, I add a 'update-in-progress' flag. If
the 2nd stage loader sees this, it knows, that something went wrong, 
and can act accordingly.

The problem is, although I can read the state variable out of the 
environment, the kernel boot fails with no messages from the kernel.
No earlyprintk output, nothing.


There the search started:

removing the new file, by just using 'rm /var/update-in-progress'
made the kernel boot again. ... most of the time.

The removing some scripts (not relevant to this bootpath) from
the image bundled scriptenv helped,.. sometimes.

I removed the 'common/bareboxenv' file before every recompile.

I've investigated size issues: I use defaultenv-2 + custom scripts
together ~ 225k worth of ash scripts giving a 15k
common/barebox_default_env. I found no correlation between size
and failure.

I've tried to boil down the scripting stuff, to get a clean
failure case, but no success here, hence I don't post the code
in this mail.

I can compile bb images that render the kernel unbootable.
So I ruled out issues when writing the environment from linux.

The rescue kernel is bootable without any additional kernel
parameters. So I should get at least something from there.
just 'bootm /dev/rescue' works right away.

I've ruled out partition overlaps. The partitions (8 of them)
are registered with mtdparts-add by means of a quite bulky
environment variable.

I've tried to add a big binary blob to the scriptenv, 
making the bb image nearly 256k big. No reproducible
failure,

I've tried to add 30 shell scripts echoing some line.
I source those from /env/bin/init, to see whether ash
couqhs up on them, also no reproducible failure.


So my questions are:

Do you know of any side effects the above patch may introduce?

Do you know of a way to cause a kernel to fail to boot, by just
adding a irrelevant shell script to the boot environment?

What else can I look for?

Best regards,
Christian


-- 
Christian Kapeller
cmotion GmbH
Kriehubergasse 16
1050 Wien / Austria
http://www.cmotion.eu

christian.kapeller at cmotion.eu
Phone: +43 1 789 1096 38