[LEDE-DEV] libubox, procd: init process hangs
Felix Fietkau
nbd at nbd.name
Tue May 17 04:29:22 PDT 2016
Hi Mats,
On 2016-05-17 12:03, Mats Karrman wrote:
> Hi Felix, others,
>
> I have been experiencing problems with the init scripts dispatch
> suddenly stopping (indefinitely).
> This happens maybe once in 100 reboots.
> After inserting a new start script that launches another daemon
> (cgrulesengd) very early in the boot process, the failures started to
> come a lot more frequently, maybe once in 10 reboots, making this a real
> issue.
> I'm normally using the versions of procd and libubox selected by OpenWRT
> BB branch but I have tested the latest versions from the git repos with
> the same result.
> So far I have only got this to happen on a quite fast board (ARM dual
> CorexA9 @ 1GHz).
> Inserting trace prints in libubox changes behavior, also suggesting the
> problem is timing dependent.
>
> When init hangs:
> - it is still possible to log in on console
> - there is always a zombie start script, e.g. S11sysctl.
> - by killing a process (e.g. ubusd or cgrulesengd) the init process
> continues.
> - otherwise generating an event, e.g inserting something into a USB port
> also makes the init continue.
>
> I have traced the problem down to the "epoll_wait" call in
> libubox::uloop.c::uloop_fetch_events().
> The following patch makes sure epoll_wait is never called without a timeout.
> My tests show that this solves the problem.
> I have been able to observe the case when the boot gets stuck and then
> continues after the 8s timeout.
> However I'm not sure that this is the correct fix for the problem as
> there may be other reasons that there is no event in the first place.
> Your feedback would be welcome!
I just took a look at the code and uloop's processing of signals looked
a bit racy to me. I've pushed a commit that makes it use signalfd if
available. I also found that waitpid wasn't being retried on signal
interrupt, so I added an extra check there. The changes are in libubox
git, but not in OpenWrt/LEDE yet.
Please test if this fixes your issue.
Thanks,
- Felix
More information about the Lede-dev
mailing list