[LEDE-DEV] libubox, procd: init process hangs

Yousong Zhou yszhou4tech at gmail.com
Tue Jun 7 05:49:59 PDT 2016


On 7 June 2016 at 06:11, Xinxing Hu <xinxing.huchn at gmail.com> wrote:
> Hi Guys,
>
> I have another idea about this issue. Maybe it is not kernel, but uloop
> related. I read procd and libubox code a little bit, and it seems there is a
> potential issue existing in uloop_run().
>
> In general, uloop_run() is running in a while loop:
>
> while()
>         1, Process timeouts list
>
>         2, Handle terminated child processes
>
>         3, uloop_run_events(timeout) => calls epoll_wait()
> done
>
> During boot, procd_inittab_run("sysinit") is called in Step1, which calls
> add_initd(). add_initd() would add an entry in timeouts list, whose callback
> function is to execute an rc.d/S* script.
>
> When the while loop goes back to Step1 again, the timeouts list would be
> processed, and an rc.d/S* script would be executed in a child process while
> the parent process remains in the while loop. If everything goes fine, when
> the child process is terminated, the parent process will handle terminated
> child process by calling waitpid() in the while loop. A process callback
> function will also be called, which adds another timeout entry in timeouts
> list. This new entry corresponds to the next rc.d/S* script to be executed.
> When the while loop reaches Step1 again, the next rc.d/S* script would be
> invoked.
>
> Everything looks OK till now. However, due to process scheduling, problems
> might happen when uloop_run_events(uloop_get_next_timeout(&tv)) is called.
> For instance: if the child process is still running when
> uloop_get_next_timeout(&tv) is called, then the timeouts list is already
> empty at that time, so the return value of uloop_get_next_timeout(&tv) would
> be -1. Furthermore, if the child process is terminated and signal handler is
> executed before epoll_wait() is called, then epoll_wait will block the
> parent process forever until some other events it is listening to arrive. In
> this sense, other events arriving just hide this issue. During the boot, as
> long as /etc/rc.d/S* is not finished executing, epoll_wait() should never be
> blocked.
>
> I think, a potential solution might be: during initialization, we let uloop
> listens to a kind of 'dummy' event. Every time when the child process
> finishes executing a rc.d/S* script, we send a 'dummy' event. In this case,
> epoll_wait would never be blocked during booting.

Interesting.  Looks like the same issue can also happen to the
uloop_canceled check.  Python's tornado library uses pipe() as a
"waker" to "calls the given callback on the next I/O loop iteration."

Can you give the attached patch a try to see if it can solve the issue
for you?  It was only just run-tested on qemu malta to make sure the
patched libubox still runs.

                yousong

>
> Best Regards,
> Xinxing
>
>
>
>
> On 2016/5/17 18:03, Mats Karrman wrote:
> Hi Felix, others,
>
> I have been experiencing problems with the init scripts dispatch
> suddenly stopping (indefinitely).
> This happens maybe once in 100 reboots.
> After inserting a new start script that launches another daemon
> (cgrulesengd) very early in the boot process, the failures started to
> come a lot more frequently, maybe once in 10 reboots, making this a real
> issue.
> I'm normally using the versions of procd and libubox selected by OpenWRT
> BB branch but I have tested the latest versions from the git repos with
> the same result.
> So far I have only got this to happen on a quite fast board (ARM dual
> CorexA9 @ 1GHz).
> Inserting trace prints in libubox changes behavior, also suggesting the
> problem is timing dependent.
>
> When init hangs:
> - it is still possible to log in on console
> - there is always a zombie start script, e.g. S11sysctl.
> - by killing a process (e.g. ubusd or cgrulesengd) the init process
> continues.
> - otherwise generating an event, e.g inserting something into a USB port
> also makes the init continue.
>
> I have traced the problem down to the "epoll_wait" call in
> libubox::uloop.c::uloop_fetch_events().
> The following patch makes sure epoll_wait is never called without a timeout.
> My tests show that this solves the problem.
> I have been able to observe the case when the boot gets stuck and then
> continues after the 8s timeout.
> However I'm not sure that this is the correct fix for the problem as
> there may be other reasons that there is no event in the first place.
> Your feedback would be welcome!
>
> BR // Mats
> Currently working for Inteno Broadband Technology AB
>
> ---
> Avast 防毒软件已对此电子邮件执行病毒检查。
> https://www.avast.com/antivirus
>
>
> _______________________________________________
> Lede-dev mailing list
> Lede-dev at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/lede-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-uloop-use-a-waker-for-notifying-sigchld-and-loop-can.patch
Type: application/octet-stream
Size: 3299 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/lede-dev/attachments/20160607/83ff4b75/attachment.obj>


More information about the Lede-dev mailing list