[LEDE-DEV] libubox, procd: init process hangs

Felix Fietkau nbd at nbd.name
Tue May 17 04:29:22 PDT 2016


Hi Mats,

On 2016-05-17 12:03, Mats Karrman wrote:
> Hi Felix, others,
> 
> I have been experiencing problems with the init scripts dispatch 
> suddenly stopping (indefinitely).
> This happens maybe once in 100 reboots.
> After inserting a new start script that launches another daemon 
> (cgrulesengd) very early in the boot process, the failures started to 
> come a lot more frequently, maybe once in 10 reboots, making this a real 
> issue.
> I'm normally using the versions of procd and libubox selected by OpenWRT 
> BB branch but I have tested the latest versions from the git repos with 
> the same result.
> So far I have only got this to happen on a quite fast board (ARM dual 
> CorexA9 @ 1GHz).
> Inserting trace prints in libubox changes behavior, also suggesting the 
> problem is timing dependent.
> 
> When init hangs:
> - it is still possible to log in on console
> - there is always a zombie start script, e.g. S11sysctl.
> - by killing a process (e.g. ubusd or cgrulesengd) the init process 
> continues.
> - otherwise generating an event, e.g inserting something into a USB port 
> also makes the init continue.
> 
> I have traced the problem down to the "epoll_wait" call in 
> libubox::uloop.c::uloop_fetch_events().
> The following patch makes sure epoll_wait is never called without a timeout.
> My tests show that this solves the problem.
> I have been able to observe the case when the boot gets stuck and then 
> continues after the 8s timeout.
> However I'm not sure that this is the correct fix for the problem as 
> there may be other reasons that there is no event in the first place.
> Your feedback would be welcome!
I just took a look at the code and uloop's processing of signals looked
a bit racy to me. I've pushed a commit that makes it use signalfd if
available. I also found that waitpid wasn't being retried on signal
interrupt, so I added an extra check there. The changes are in libubox
git, but not in OpenWrt/LEDE yet.
Please test if this fixes your issue.

Thanks,

- Felix



More information about the Lede-dev mailing list